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We  consider  different  reconfigurable  meshes  with  buses  architectures  such  as 
PARBUS,  RMESH  and  MRN  for  parallel  computing.  We  compare  reconfigurable 
meshes  with  buses  with  other  parallel  computing  models  such  as  PRAM,  fixed  topology 
interconnection  networks  and  meshes  with  static  buses. 

We  consider  sorting  n  numbers  onnxn  reconfigurable  meshes  with  buses.  We 
develop  simple  and  optimal  areaxtime2  complexity  algorithms  for  all  reconfigurable 
meshes  with  buses  architectures  (PARBUS/RMESH/MRN).  We  extend  our  algorithms  to 
higher  dimension  reconfigurable  meshes  with  buses.  We  show  that  it  is  faster  to  sort  than 
to  select  the  £th  element  using  Lin's  algorithm  on  a  PARBUS. 
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We  show  that  the  lower  bound  for  the  number  of  routes  needed  to  sort  n  numbers 
on  an  n  x  n  bidirectional  mesh  established  by  Schnorr  and  Shamir  is  incorrect.  We  pro- 
vide algorithms  that  are  able  to  sort  using  a  number  of  routes  that  is  very  close  to  the  dis- 
tance lower  bound  for  bidirectional  as  well  as  strict  unidirectional  meshes.  We  point  out 
the  practical  advantages  of  our  algorithms  when  applied  to  mesh  connected  computers 
with  tR  »  2  tc.  We  show  how  to  simulate  the  mesh  algorithms  on  reconfigurable 
meshes  with  buses. 

We  develop  0(1)  time  algorithms  to  determine  the  convex  hull  of  N  planar  points, 
find  the  smallest  enclosing  rectangle  of  a  set  of  N  planar  points,  triangulate  a  set  of  N 
planar  points,  and  for  the  ECDF  search  problem  for  N  planar  points  on  N  x  N 
reconfigurable  meshes  with  buses.  We  also  develop  an  O(l)  time  algorithm  to  determine 
the  voronoi  diagram  of  N  planar  points  on  N3  x  N  reconfigurable  meshes  with  buses. 
We  show  that  Reisis's  algorithm  for  determining  the  convex  hull  of  N  planar  points  is 
incorrect.  Our  algorithms  work  on  all  N  x  N  reconfigurable  meshes  with  buses 
(PARBUS/RMESH/MRN). 
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CHAPTER  1 
MESH  CONNECTED  PARALLEL  COMPUTER  ARCHITECTURES 

Introduction 

Most  control-flow  based  parallel  computational  models  can  be  classified  into  either 
Shared  Memory  Model  (SMM)  or  Distributed  Memory  Model  (DMM).  In  SMM,  all  pro- 
cessor elements  (PEs)  share  a  global  memory  and  communicate  with  each  other  via 
shared  memory.  In  SMM,  each  PE  has  its  own  local  control  and  local  memory.  Each  pro- 
cessor is  allowed  to  access  an  arbitrary  location  in  shared  memory  based  on  allowed 
read/write  modes.  In  an  exclusive-read,  exclusive-write  (EREW)  model,  at  most  one  PE 
can  read  from  or  write  to  a  location  in  the  shared  memory  in  a  cycle.  In  a  concurrent- 
read,  exclusive-write  (CREW)  model,  multiple  PEs  can  read  from  the  same  memory 
location,  but  only  one  PE  can  write  to  a  location  in  the  shared  memory.  In  concurrent- 
write  (CW),  multiple  write  to  a  single  location  in  memory  is  allowed.  The  value  written 
is  resolved  based  on  a  write  conflict  arbitration  scheme.  PRAM  (Parallel  Random  Access 
Memory)  characterizes  the  SMM.  Karp  [KARP90]  lists  parallel  algorithms  for  many 
applications  on  SMM. 

There  is  no  global  memory  to  share  in  a  DMM  and  the  interprocessor  communica- 
tion is  done  via  an  interconnection  network.  In  this  model  each  PE  has  its  own  local 


-1- 


-2- 

memory  and/or  blocks  of  distributed  memory  modules  available  to  PEs  through  an  align- 
ment network  connecting  PEs  with  memory  modules.  In  DMM,  there  is  a  delay  associ- 
ated with  remote  memory  access  through  a  network.  The  interprocessor  network  is  usu- 
ally fixed  at  design  phase.  All  the  data  routing  and  communication  between  PEs  is  depen- 
dent on  the  capabilities  of  the  network.  The  computation  time  on  a  DMM  is  dependent  on 
the  diameter  of  the  communication  network,  i.e.,  the  maximum  time  to  communicate 
and/or  route  data  between  any  two  PEs. 

There  are  various  architectural  classification  schemes  for  parallel  computational 
models  [HWAN84].  Flynn's  Classification  [FLYN66]  is  based  on  multiplicity  of  instruc- 
tion streams  and  data  streams  on  a  multiprocessor  system.  Figure  1.1  gives  a  block 
diagram  for  SIMD  (Single  Instruction  Multiple  Data)  mutiprocessor  computer  architec- 
ture. While  many  interconnection  networks  [HWAN84],  [FENG81],  [ULLM84]  have 
been  proposed,  the  two-dimensional  mesh,  the  hypercube,  mesh  of  trees,  and  the  pyramid 
have  emerged  as  the  dominant  ones  for  many  practical  applications.  Figures  1.2  -  1.7 
give  examples  of  these  interconnection  networks. 

Mesh  Connected  Computer 

In  Mesh  Connected  Computer  (MCC)  model,  the  PEs  can  be  thought  of  as  logically 
arranged  in  a  ^-dimensional  array  /!("*- 1>  h*_2,—»  /io)»  where  n,-  is  the  size  of  the  j'th 
dimension  and  the  number  of  PEs  p  =  /i*_i*/z*_2*  •••*«()•  The  PE  at  location 
A(ik_l,...,i0)is  connected  to  the  PEs  at  locations  A(i/C_1,...,ij±  l,...,i'o)»  0  <  j  <  k,  pro- 
vided they  exist.  Each  PE  is  connected  to  its  2k  neighbors.  It  is  assumed  that  a  pE  may 
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Figure  1.1  Block  diagram  of  an  SIMD  Computer 
transmit  data  to  any  of  its  nearest  neighbors  in  unit  time. 

Data  may  be  transmitted  from  one  PE  to  another  PE  via  this  interconnection  pattern. 
The  interconnection  scheme  for  a  16  PE  MCC  with  k  =  2  is  given  in  Figure  1.2.  Some 
simple  variations  to  the  mesh  include  end-around  connections.  Figure  1.3  shows  an  MCC 
with  wrap-around  connections,  the  right  end  of  each  row  connects  to  the  left  end  of  the 
row  and  the  bottom  of  each  column  connects  to  its  top.  Figure  1.4  shows  an  MCC  with 
toroidal  connection.  The  right  end  of  a  row  connects  to  the  left  end  of  the  next  row  and 
the  bottom  of  a  column  connects  to  the  top  of  the  next  column.  The  ILLIAC-IV,  MPP, 
and  DAP  are  some  examples  of  mesh  connected  computers. 


The  regular  structure  and  nearest  neighbor  (local  connectivity)  connection  makes 
MCC  suitable  for  VLSI  implementation.  The  fraction  of  the  area  occupied  by  intercon- 
nect wires  in  a  MCC  is  fixed  regardless  of  the  size  N.  MCC  seems  to  be  a  natural  struc- 
ture for  solving  many  problems  in  matrix  computation  and  image  processing.  In  parallel 
and  distributed  processing,  the  computations  are  usually  constrained  by  information  flow 
and  data  movement  operations  rather  than  processing  time  at  PEs.  MCC  is  not  suitable 
for  computations  like  sorting  where  a  single  piece  of  data  has  to  be  moved  from  one 
corner  to  the  opposite  corner.  In  a  N1'2  x  N1'2  2-MCC,  the  worst  case  data  movement 
may  take  as  much  as  Q(Nl/2)  time.  This  deficiency,  the  large  diameter,  is  the  main 
shortcoming  of  MCC's.  Most  problems  involving  communications  over  long  distance 
will  have  a  lower  bound  on  computation  time  at  least  equal  to  the  diameter  of  the  mesh. 
Dekel  [DEKE81]  has  parallel  algorithms  for  matrix  and  graph  problems  on  MCC  and 
other  models.  The  studies  [BAUD78],  [KUMA83],  [MARB88],  [NASS79],  [SCHE86], 
[SCHER89],  [THOMK77],  [PARK87,90]  have  sorting  algorithms  on  MCC  and 
[NASS80]  has  an  optimal  routing  algorithm  on  MCC.  The  studies  [HWAN84], 
[MILLS84a,88de,89],  [STOU82],  and  [LEIG92]  have  listed  MCC  algorithms  for  many 
applications. 
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Figure  1.2  A  4  x  4  Mesh  Connected  Computer 


Mesh  of  Trees 


Mesh  of  Trees  (MOT)  is  a  hybrid  network  based  on  meshes  and  trees.  MOT  has  a 
smaller  diameter  and  is  faster  than  a  MCC  for  the  same  number  of  nodes.  MOT  is  area 
efficient  in  terms  of  VLSI  complexity  [THOM79,80],  [ULLM84].  MOT  is  defined  in 
[LEIG81],  [LEIG84],  [CAPE83],  [NATH83].  The  N  x  N  mesh  of  trees  is  constructed 
from  an  N  x  N  grid  of  processors  by  adding  processors  and  wires  to  form  a  complete 
binary  tree  in  each  row  and  each  column.  The  leaves  of  the  trees  are  the  N2  grid  nodes 
and  added  nodes  are  the  internal  nodes  of  the  trees.  The  network  has  3N2-2N  proces- 
sors. The  leaf  and  the  root  processors  have  degree  two  and  all  others  have  degree  3.  The 
N  x  N  MOT  has  diameter  4log2(N).  Figure  1.5  shows  a  4  x  4  MOT. 
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Figure  1.3  Mesh  with  wrap-around  connections 


Figure  1.4  Mesh  with  toroidal  connections 
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The  two-dimensional  mesh  of  trees  has  a  very  natural  and  regular  structure.  There 
are  variations  to  MOT  such  as  an  additional  sets  of  trees  along  the  diagonal  of  the  mesh, 
connecting  original  grid  nodes  with  mesh  edges.  A  reduced  mesh  of  trees  has  1/3  as 
many  processors  as  the  usual  mesh  trees.  Mesh  of  Trees  can  also  be  defined  in  higher 
dimensions.  MOT  offer  dramatic  speedup  in  time  over  meshes  for  the  same  number  of 
processors.  Many  problems  can  be  solved  in  polylogarithmic  time  with  polynomial  size 
MOT  (Class  NC).  MOT  is  area  universal,  i.e.,  it  can  simulate  any  other  network  with  the 
same  VLSI  wire  area  with  only  a  polylogarithmic  factor  slow  down.  Nath  et  al. 
[NATH83]  have  area  efficient  algorithms  for  matrix  computation,  sorting,  FFT  and 
graph  problems.  Other  studies  [ALNU86],  [ALNU87],  [ALNU88],  [PRAS86]  describe 
mesh  of  trees  algorithms  for  image  processing,  graph  problems,  geometric  problems. 
Leighton  [LEIG92]  has  listed  mesh  of  trees  algorithms  for  many  applications. 

Pyramid  Computer 

The  Pyramid  Computer  (PC)  is  a  combination  of  tree  and  mesh  structures.  A 
pyramid  computer  of  size  N  has  an  Nin  x  N112  mesh  connected  computer  as  its  base 
(level  0),  and  log4(A0  levels  of  mesh  connected  computers  above.  A  PE  at  level  k  is  con- 
nected to  4  siblings  at  level  k,  4  children  at  level  k  - 1  and  a  parent  at  level  k  + 1.  Figure 
1.6  shows  a  PC  with  16  PEs  at  the  base.  The  apex  is  the  topmost  level  and  has  exactly 
one  PE.  If  some  level  of  PC  is  a  2*  x  2*  mesh,  then  the  level  above  it  is  a  2*~ 1  x  2k~ 1 
mesh  and  the  level  below  is  a  2*+1  x  2k  +  l  mesh.  Let  (i,j,k)  represent  the  PE  in  posi- 
tion   (i,j)    at    level    k,    then    it    is    connected    to    PEs    (i±l,j,k),    (i,j±l,k), 
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CJ        Original  mesh  nodes 

I I        Column  tree  nodes 

/n      Row  tree  nodes 


Figure  1.5  A  4  x  4  Mesh  of  Trees 


(i  div  2,j  div  2,k  +  l),  (2i,2j,k-l),  (2i  +  l,2j,k-l),  (2i,2j+l,k-l),  and 
(2/  + 1 ,2;+ 1  ,fc  - 1 ),  if  they  exist.  Cantoni  and  Levialdi  [CANT86]  have  proposed  varia- 
tions of  PC  for  computer  vision  applications. 

A  PC  with  a  base  size  N2  has  - — - — -  PEs  and  has  a  diameter  of  the  O(log2(A0)- 

PC  is  not  as  powerful  as  MOT.  There  are  many  problems  which  admit  algorithms  in  class 
NC  on  a  JV2-MOT  but  require  V^  steps  on  N  x  N  PC.  The  studies  [TANI82],  [TANI84], 
[MILLS 84b,87a],  [CHAN88]  have  algorithms  for  image  processing,  data  movement  and 


geometric  problems  on  PC. 


Base(l,l) 


Apex 


(4,4) 


Figure  1.6  Pyramid  Computer  with  base  16  (42)  nodes 


Hypercube 

A  ^-dimensional  hypercube  consists  of  N  =  2*  nodes.  The  nodes  are  labeled 
0,1,...,2*-1.  Two  nodes  in  a  hypercube  are  connected  if  their  labels  differ  in  exactly  one 
bit  position  in  binary  representation.  The  degree  of  each  node  is  log2(A0-  The  diameter  of 
an  N  PE  hypercube  is  also  \og2(N).  Figure  1.8  shows  an  8  PE  hypercube.  The  hypercube 
has  become  a  popular  architecture  because  of  its  versatility  and  efficiency.  It  can  be  used 
both  for  general  and  special  purpose  applications.  It  can  efficiently  simulate  any  other 
network  of  the  same  size.  It  can  simulate  meshes,  binary  trees  and  mesh  of  trees 
efficiently.  Ncube  and  iPSC  are  examples  of  commercially  available  hypercubes. 
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Since  the  degree  of  a  node  in  a  hypercube  is  log2(A0.  the  number  of  connections  per 
PE  grows  logarithmically  with  the  size  N  of  the  network.  This  can  be  a  problem  for  large 
size  hypercubes  during  physical  layout  for  VLSI  implementation.  There  are  bounded- 
degree  derivatives  of  hypercube  such  as  butterfly,  shuffle-exchange  graphs,  cube- 
connected-cycles,  Benes  network,  de  Bruijn  graphs  [ULLM84].  [RANK91],  [LEIG92], 
[MTLLS87c],  [STOJ86]  have  listed  hypercube  algorithms  for  many  application  areas 


including  image  processing  and  computational  geometry. 
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Figure  1.7  Hypercube  with  8  (23)  nodes 


Modified  MCC 


Several  modifications  to  MCC  have  been  proposed  to  reduce  the  worst  case  PE-to- 
PE  communication  time.  These  include  attaching  trees  (MOT),  embedding  one  or  more 
global  meshes  within  MCC  architecture,  pyramid  computer,  attaching  buses  to  proces- 
sors. Carlson  [CARL85],  [CARL86]  explores  embedding  one  or  more  global  meshes  to 
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MCC.  This  modification  improves  asymptotic  time  complexity  for  computations  with 
low  to  medium  interprocessor  communications  such  as  tree  computations,  prefix  compu- 
tations, finding  connected  components  of  a  graph.  There  is  no  improvement  for  computa- 
tions requiring  high  interprocessor  communications  such  as  sorting  and  computing  a 
Fourier  transform. 

Another  modification  suggested  to  MCC  is  adding  one  or  more  broadcast  buses.  A 
broadcast  bus  connects  PEs  together  so  that  they  can  transmit  data  to  all  other  attached 
PEs  in  unit  time.  It  is  assumed  that  only  one  PE  can  broadcast  at  any  time.  Ableson 
[ABEL80]  and  Gentleman  [GENT78]  have  analyzed  problems  where  the  worst  case 
solution  is  constrained  by  data  movement.  Jordan  [JORD78,79],  and  Gentleman 
[GENT78]  were  first  to  consider  supplementing  MCC  with  a  global  bus.  Bhokari 
[BOKH81,84],  Stout  [STOU83]  consider  finding  the  maximum,  semigroup  operations 
and  finding  median  on  MCC  with  a  single  bus.  They  report  asymptotic  improvement  in 
time  complexity  over  MCC.  There  is  no  improvement  for  computations  requiring  high 
interprocessor  communications  such  as  sorting. 

Prasanna  Kumar  [PRAS87ab],  [PRAS89],  Stout  [STOU86],  and  Chen 
[CHEN89,90]  consider  augmenting  MCC  with  a  bus  for  each  row  and  column.  Figure  1.8 
shows  such  a  MCC  with  multiple  buses.  Aggarwal  [AGGA86]  considers  augmenting  d- 
dimension  MCC  with  k  global  buses  for  finding  the  maximum  and  matrix  multiplication. 
The  studies  [PRAS87b],  [PRAS89]  and  [CHEN89.90]  consider  semigroup  computations, 
linear  algebra,  numerical  computations,  image  processing,  and  geometric  problems  on 
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MCC  with  multiple  buses.  They  obtain  improvement  in  asymptotic  time  complexity  over 
MCC  with  a  single  bus.  In  all  these  models  the  bus  configurations  are  static,  i.e.,  it  is 
fixed  over  computations. 


Row  Buses 


Column  Buses 


Figure  1.8  A  4  x  4  MCC  with  Row  and  Column  Broadcasting 


Reconfigurable  Meshes  with  Buses 


Several  authors  have  considered  various  reconfigurable  parallel  computational 
models  and  architectures.  These  include  the  Bus  Automata  (BA)  of  Rothstein  [ROTH76], 
the  Configurable  Highly  Parallel  Computer  (CHiP)  of  Synder  [SYND82],  the 
Reconfigurable  Meshes  with  buses  (RMESH)  of  Miller  et  al.  [MILLP87b,88abc],  the 
Content  Addressable  Array  Processor  (CAAP)  of  Weems  et  al.  [WEEM87,89],  the 
polymorphic  torus  of  Li  and  Maresca  [LI89ab,  MARE89],  the  processor  array  with  a 
reconfigurable  bus  system  (PARBUS)  of  Wang  and  Chen  [WANG90ab],  and  the 


reconfigurable  network  (RN)  of  Ben-Asher  et  al.  [BEN91]. 
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CHiP  [SYND82]  consists  of  homogeneous  PEs  located  at  grid  points  and  inter- 
posed with  programmable  switches.  Each  PE  and  switch  has  local  memory.  PE  store  pro- 
gram and  data  while  each  switch  stores  a  fixed  number  of  allowed  interconnection  pat- 
terns. The  switches  are  under  a  central  global  control  of  the  program  running  the 
machine.  Various  interconnection  between  processors  are  accomplished  by  using  dif- 
ferent stored  patterns.  CHiP  architecture  differs  from  other  reconfigurable  meshes  in  two 
ways.  First,  the  number  of  connection  pattern  between  PEs  is  constrained  by  the  limited 
switch  settings  stored  in  the  local  memory.  Second,  the  switch  settings  are  under  a  central 
control  rather  than  under  distributed  control  of  PEs.  Several  issues  related  to  Wafer 
Scale  Integration  of  CHiP  are  discussed  in  [HEDL82].  Gannon  [GANN81]  discusses 
solutions  to  numerical  and  signal  processing  problems  on  CHiP. 

Bus  Automata  (BA)  [ROTH76]  can  be  considered  as  a  cellular  automata  (arrays  of 
automata  with  interconnections  between  any  one  automata  and  a  fiexd  bounded  number 
of  neighbors)  with  a  locally  switchable  global  communication  network.  Each  cell  (PE) 
can  connect  a  subset  of  its  input  lines  (finks)  directly  to  some  subset  of  its  output  links 
through  internal  channels.  By  alternating  links  and  channels  "buses"  are  set  up  which 
can,  in  principle,  establish  direct  communication  from  any  particular  cell  to  any  desired 
subset  of  cells.  Communication  on  a  bus  is  assumed  to  take  unit  time.  The  works  of 
Rothstein  et.  al.  [ROTH76,77ab,78,79],  [MOSH79],  [CHAM87]  focus  on  developing 
0(1)  algorithms  for  problems  in  the  area  of  parsing  and  formal  languagesa,  pattern  recog- 
nition, neural  modeling,  cryptographic  methods. 
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Other  mesh-like  architectures  with  reconfigurable  buses  are  the  content  addressable 
array  processor  (CAAP)  of  Weems  et  al.  [WEEM87.89],  the  polymorphic  torus  of  Li  and 
Maresca  [LI89,  MARE89],  the  reconfigurable  mesh  with  buses  (RMESH)  of  Miller  et  al. 
[MILLP87b,  88abc],  the  processor  array  with  a  reconfigurable  bus  system  (PARBUS)  of 
Wang  and  Chen  [WANG90ab],  and  the  reconfigurable  network  (RN)  of  Ben-Asher  et  al. 
[BENA91].  We  refer  to  all  these  models  as  Reconfigurable  Meshes  with  Buses  (RMBs). 

The  CAAP  [WEEM87,89]  and  RMESH  [MILLP87b,  MILLP88abc]  architectures 
appear  to  be  quite  similar.  So,  we  shall  describe  the  RMESH  only.  In  this,  we  have  a  bus 
grid  with  annxn  arrangement  of  processors  at  the  grid  points  (see  Figure  1.9  for  a  4x4 
RMESH  ).  Each  grid  segment  has  a  switch  on  it  which  enables  one  to  break  the  bus,  if 
desired,  at  that  point.  When  all  switches  are  closed,  all  n2  processors  are  connected  by 
the  grid  bus.  The  switches  around  a  processor  can  be  set  by  using  local  information.  If  all 
processors  disconnect  the  switch  on  their  north,  then  we  obtain  row  buses  (Figure  1.10). 
Column  buses  are  obtained  by  having  each  processor  disconnect  the  switch  on  its  east 
(Figure  1.11).  In  the  exclusive  write  model  two  processors  that  are  on  the  same  bus  can- 
not simultaneously  write  to  that  bus.  In  the  concurrent  write  model  several  processors 
may  simultaneously  write  to  the  same  bus.  Rules  are  provided  to  determine  which  of  the 
several  writers  actually  succeeds  (e.g.,  arbitrary,  maximum,  exclusive  or,  etc.).  Notice 
that  in  the  RMESH  model  it  is  not  possible  to  simultaneously  have  n  disjoint  row  buses 
and  n  disjoint  column  buses  that,  respectively,  span  the  width  and  height  of  the  RMESH. 
It  is  assumed  that  processors  on  the  same  bus  can  communicate  in  O(l)  time.  RMESH 
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algorithms  for  fundamental  data  movement  operations  and  image  processing  problems 
can  be  found  in  [MILLP88abc,  MILLP91ab,  JENQ91abc]. 

An  nx  n  PARBUS  (Figure  1.12)  [WANG90ab]  is  an  n  x  n  mesh  in  which  the 
interprocessor  links  are  bus  segments  and  each  processor  has  the  ability  to  connect 
together  arbitrary  subsets  of  the  four  bus  segments  that  connect  to  it.  Bus  segments  that 
get  so  connected  behave  like  a  single  bus.  The  bus  segment  interconnections  at  a  proc- 
cessor  are  done  by  an  internal  four-port  switch.  If  the  upto  four  bus  segments  at  a  proces- 
sor are  labeled  N  (North),  E  (East),  W  (West),  and  S  (South),  then  this  switch  is  able  to 
realize  any  set,  A  =  [Al,A2},of  connections  where  A j  £  {N,E,W,S},  1  <  &  2  and  the 
A,'s  are  disjoint.  For  example  A  =  { {N,S},  {E,W} }  results  in  connecting  the  North  and 
South  segments  together  and  the  East  and  West  segments  together.  If  this  is  done  in  each 
processor,  then  we  get,  simultaneously,  disjoint  row  and  column  buses  (Figure  1.13  and 
1.14).  If  A  =  {{N,S,E,W},<|>},  then  all  four  bus  segments  are  connected.  PARBUS  algo- 
rithms for  a  variety  of  applications  can  be  found  in  [MTLLP91ab,  WANG90ab,  LIN92, 
JANG92abcde,  OLAR92,  ELGI91,  LI91ab].  Observe  that  in  an  RMESH  the  realizable 
connections  are  of  the  form  A  =  {A  i },  A  j  £  {NJE,W,S }. 

The  polymorphic  torus  architecture  [LI89ab,  MARE89]  is  identical  to  the  PARBUS 
except  that  the  rows  and  columns  of  the  underlying  mesh  wrap  around  (Figure  1.15).  In  a 
reconfigurable  network  (RN)  [BENA91]  no  restriction  is  placed  on  the  bus  segments  that 
connect  pairs  of  processors  or  on  the  relative  placement  of  the  processors,  i.e.,  processors 
may  not  lie  at  grid  points  and  a  bus  segment  may  join  an  arbitrary  pair  of  processors. 
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Figure  1.9  4x4  RMESH 
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Figure  1.10  Row  buses 
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Figure  1.11    Column  buses 


Figure  1.12   4x4  PARBUS 
Like  the  PARBUS  and  polymorphic  torus,  each  processor  has  an  internal  switch  that  is 
able  to  connect  together  arbitrary  subsets  of  the  bus  segments  that  connect  to  the  proces- 


sor.  Ben-asher  et  al.   [BENA91]   also  define   a  mesh  restriction   (MRN)  of  their 
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Figure  1.13   Row  buses  in  a  PARBUS 


Figure  1.14    Column  buses  in  a  PARBUS 
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reconfigurable  network  .  In  this,  the  processor  and  bus  segment  arrangement  is  exactly  as 
for  the  PARBUS  (Figure  1.12).  However,  the  switches  internal  to  processors  are  able  to 
obtain  only  the  10  bus  configurations  given  in  Figure  1.16  .  Thus  an  MRN  is  a  restricted 
PARBUS. 

While  we  have  defined  the  above  reconfigurable  bus  architectures  as  square  two 
dimensional  meshes,  it  is  easy  to  see  how  these  may  be  extended  to  obtain  non  square 
architectures  and  architectures  with  more  dimensions  than  two. 


Figure  1.15    4x4  Polymorphic  torus 


Comparison  of  RMB  with  other  models 


Reconfigurable  Meshes  with  Buses  (RMB)  are  attractive  because  they  have  a  diam- 


eter of  1.  The  time  taken  to  broadcast  along  a  bus  is  assumed  constant  0(1)  regardless  of 
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Figure  1.16  Local  Configurations  allowed  for  a  switch  in  MRN 
the  length  of  the  bus.  This  assumption  may  be  unrealistic  on  currently  realizable  parallel 
architectures.  The  conductance  and  capacity  of  wires  and  transistors  imply  a  constant 
lower  bound  for  the  transmission  rate  of  a  single  switch.  Despite  this,  there  already  exist 
experimental  RMB  on  VLSI  chip  such  as  YUPPIE  (Yorktown  Ultra-Parallel 
Polymorphic  Image  Engine  [LI89ab]  and,  CAAP  [WEEM87,89].  RMB  are  simple  and 
easy  to  build  with  short  wires  (each  edge  has  0(1)  length)  in  VLSI.  The  time  to  change  a 
state  of  a  switch  is  orders  of  magnitude  higher  than  the  time  for  a  signal  to  traverse  a  dis- 
tance equal  to  the  diameter  of  a  switch  [ROTH88].  This  motivates  nonelectronic  or 
hybrid  implementation  of  RMBs.  Schuster  [SCHU91]  considers  optical  implementations. 

The  RMB  model  is  similar  to  an  SIMD  MCC  except  for  the  buses  and  switches.  It 
has  an  area  O(N)  for  N1'2  x  N1'2  RMB  under  the  assumption  that  processors,  switches, 
and  single  links  have  constant  size.  It  is  assumed  that  only  a  single  PE  can  broadcast  on  a 
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bus  at  any  time,  i.e.,  it  functions  in  CREW  mode.  The  switch  is  under  control  of  adjacent 
processors.  The  switch  setting  is  dynamically  controlled  (software)  by  adjacent  proces- 
sors as  long  as  there  is  no  conflict  in  the  desired  switch  settings  by  the  adjacent  proces- 
sors. The  switch  setting  allows  the  broadcast  bus  to  be  partitioned  into  different 
configurations  (subbuses).  Each  subbus  can  function  as  a  smaller  RMB.  RMB  can  simu- 
late any  VLSI  organization  with  equivalent  area  (those  which  can  be  laid  out  in  an 
Nm  x  N112  grid)  without  loss  of  time  [MILLP87b].  RMB  supports  several  techniques 
of  CRCW  PRAM  models  giving  the  same  or  better  time  performance  as  a  CREW  PRAM 
with  N  processors.  Schuster  [SCHU91]  provides  simulation  of  PRAM's  by  RMB's  and 
vice  versa.  Miller  et  al.  [MTLLP87b]  give  efficient  simulation  of  Mesh  of  Trees  (MOT) 
on  RMB  and  a  simulation  of  Pyramid  Computers  (PC)  without  loss  of  time.  The  compu- 
tation for  problems  requiring  high  interprocessor  communication  on  fixed-topology  net- 
work require  at  least  Q(diameter)  time.  RMB  can  solve  some  of  these  problems  in  con- 
stant time  in  certain  cases  [SCHU91],  [WANG90ab,  91ab]. 

RMB  is  more  powerful  and  superior  than  PRAM  (SMM).  One  dimensional  linear 
array  of  size  N  with  reconfigurable  bus  can  compute  the  OR  of  N  Boolean  inputs  in  0(1) 
steps,  whereas  the  CREW  PRAM  with  any  polynomial  number  of  processors  require 
Q(log  AO  steps  [COOK86].  The  parity  (summation  modulo  2)  of  N  inputs  is  computed  in 
0(1)  time  on  a  3  x  N  PARBUS  and  MRN  [LI91a].  It  takes  Q(log  Wloglog  N)  steps  for 
CRCW  PRAM  having  any  polynomial  number  of  processors  [BEAM87].  Moreover,  par- 
ity of  N2  inputs  can  be  computed  in  0(1)  time  on  an  N  x  N  PARBUS  and  MRN. 
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RMB  is  more  powerful  than  fixed-topology  networks  and  meshes  with  static  broad- 
cast buses.  Asymptotic  improvement  in  running  time  of  algorithms  that  solve  several 
problems  on  RMB  [MILLP87b,88abc,91ab],  [JENQ91abc],  [JANG92abcd],  [OLAR92], 
[REIS89]  have  been  reported  compared  to  efficient  algorithms  on  other  fixed-topology 
networks  such  as  mesh  of  trees  (MOT),  pyramid  computers  (PC),  hypercubes,  and 
meshes  with  static  broadcast  buses.  For  example,  parity  of  N  inputs  on  N1'2  x  N1/2 
mesh  with  multiple  buses  takes  Q(Nl/e)  [PRAS87a]  and  Q(Nm)  time  on  any  rectangu- 
lar N-processor  mesh  with  multiple  buses  [BARN91],  [CHEN89].  This  can  be  solved  in 
0(1)  time  on  a  PARBUS  and  MRN. 

Dissertation  Outline 

The  prospects  of  developing  O(l)  i.e.  constant  time  algorithms  on  RMBs  motivated 
this  dissertation.  Several  other  dissertations  have  been  motivated  by  similar  considera- 
tions [REIS89],  [SCHU91],  [WANG91a].  This  dissertation  covers  algorithms  for  sorting 
on  two-dimension  and  higher  dimension  RMBs  for  all  configurations  in  0(1)  time,  sort- 
ing N  elements  on  N  processor  meshes  using  a  number  of  routing  steps  approximately 
equal  to  the  distance  lower  bound  in  two-dimensions  and  its  simulation  on  RMBs,  con- 
stant time  algorithms  for  determining  the  convex  hull  of  a  set  of  N  planar  points,  finding 
the  smallest  enclosing  rectangle  of  a  set  of  N  planar  points,  triangulating  N  planar  points 
and  the  ECDF  search  problem  for  N  planar  points. 

In  chapter  2,  we  discuss  the  sorting  problem  on  two  and  higher  dimensions  on  all 
RMB  configurations.  Our  algorithms  require  a  fewer  number  of  routes  than  previuosly 
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published  algorithms  for  two-dimensional  RMBs.  Our  algorithms  are  much  simpler  and 
easier  to  implement.  We  show  that  it  is  faster  to  sort  than  to  select  the  kth  element  using 
Lin's  algorithm  [LIN92]. 

In  chapter  3,  sorting  N  elements  on  N  processor  meshes  and  RMBs  is  discussed.  We 
show  that  the  lower  bound  for  the  number  of  routes  needed  to  sort  on  an  n  x  n  bidirec- 
tional mesh  established  in  [SCHN86]  is  incorrect  We  give  sorting  algorithms  on  2- 
dimensional  bidirectional  and  unidirectional  meshes  using  a  number  of  routing  steps  that 
are  very  close  to  the  distance  lower  bound.  We  demonstrate  the  practical  implications  of 
our  algorithms  for  sorting  on  a  mesh.  We  also  show  how  to  simulate  the  mesh  algorithms 
on  reconfigurable  meshes  with  buses. 

In  chapter  4,  we  derive  0(1)  time  algorithms  for  determining  the  convex  hull  of  N 
planar  points,  finding  the  smallest  enclosing  rectangle  of  N  planar  points,  triangulating  N 
planar  points  and  the  ECDF  search  problem  for  N  planar  points  onJVxiV  reconfigurable 
meshes  with  buses.  We  prove  that  Reisis's  convex  hull  algorithm  [REIS92]  is  incorrect. 
Our  algorithms  work  on  all  RMB  architectures  (PARBUS/RMESH/MRN). 

Chapter  5  concludes  this  work  with  a  critical  analysis  of  the  results  and  a  discussion 
on  future  research  directions. 


CHAPTER  2 

SORTING  n  NUMBERS  ON  n  x  n  RECONFIGURABLE  MESHES 

WITH  BUSES 

Introduction 


In  this  chapter,  we  consider  the  problem  of  sorting  n  numbers  on  an  RMESH, 
PARBUS  and  MRN.  This  sorting  problem  has  been  previously  studied  for  all  three 
architectures.  We  can  sort  n  numbers  in  0(1)  on  a  three  dimensional  n  x  n  x  n  RMESH 
[JENQ91ab],  PARBUS  [WANG90],  and  MRN  [BENA91].  All  of  these  algorithms  are 
based  on  a  count  sort  [HORO90]  and  are  easily  modified  to  run  in  the  same  amount  of 
time  on  a  two  dimensional  n2  x  n  computer  of  the  same  model.  Nakano  et  al. 
[NAKA90]  have  shown  how  to  sort  n  numbers  in  O(l)  time  on  an  (n  log  n  x  n) 
PARBUS.  Jang  and  Prasanna  [JANG92a]  and  LIN  et  al.  [LIN92]  have  reduced  the 
number  of  processors  required  by  an  0(1)  sort  further.  They  both  present  0(1)  sorting 
algorithms  that  work  on  an  n  x  n  PARBUS.  Since  such  a  PARBUS  can  be  realized  using 
n2  area,  their  algorithms  achieve  the  area  time  squared  (AT2)  lower  bound  of  Q.  (n2)  for 
sorting  n  numbers  in  the  VLSI  word  model  [LEIG85].  The  algorithm  of  Jang  and  Pra- 
sanna [JANG92a]  is  based  on  Leighton's  column  sort  [LEIG85]  while  that  of  LIN  et  al. 
[LIN92]  is  based  on  selection.  Neither  is  directly  adaptable  to  run  on  an  n  x  n  RMESH 
in  O(l)  time  as  the  algorithm  of  [JANG92a]  requires  processors  be  able  to  connect  their 
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bus  segments  according  to  A  =  {  {N,S},  {E,W} }  while  the  algorithm  of  [LIN92]  requires 
A  =  {{N,S},  {E,W}}  and  {{N,W},  {S,  E}}.  These  are  not  permissible  in  an  RMESH. 
Their  algorithms  are,  however,  directly  usable  on  an  n  x  n  MRN  as  the  bus  connections 
used  are  permissible  connections  for  an  MRN.  Ben-Asher  et  al.  [BENA91]  describe  an 
O(l)  algorithm  to  sort  n  numbers  on  an  MRN  with  0(n1+e )  processors  for  any,  e  ,  e  > 
0.  This  algorithm  is  also  based  on  Leighton's  column  sort  [LEIG85]. 

In  this  chapter,  we  show  how  Leighton's  column  sort  algorithm  [LEIG85]  and  Mar- 
berg  and  Gafni's  rotate  sort  algorithm  [MARB88]  can  be  implemented  on  all  three 
reconfigurable  mesh  with  buses  (RMB)  architectures  so  as  to  sort  n  numbers  in  O(l)  time 
on  an  n  x  n  configuration.  The  resulting  RMB  sort  algorithms  are  conceptually  simpler 
than  the  O(l)  PARBUS  sorting  algorithms  of  [JANG92a]  and  [LIN92].  In  addition,  our 
implementations  use  fewer  bus  broadcasts  than  do  the  algorithms  of  [JANG92a]  and 
[LIN92].  Since  the  PARBUS  implementations  use  only  bus  connections  permissible  in 
an  MRN,  our  PARBUS  algorithms  may  be  directly  used  on  an  MRN.  For  an  RMESH, 
our  implementations  are  the  first  RMESH  algorithms  to  sort  n  numbers  in  O(l)  time  on 
sain  x  n configuration. 

In  section  2,  we  describe  Leighton's  column  sort  algorithm.  Its  implementation  on 
an  RMESH  is  developed  in  section  3.  In  section  4,  we  show  how  to  implement  column 
sort  on  a  PARBUS.  Rotate  sort  is  considered  in  sections  5  through  7.  In  section  5  we 
describe  Marberg  and  Gafni's  rotate  sort  [MARB88].  The  implementation  of  rotate  sort 
is  obtained  in  sections  6  and  7  for  RMESH  and  PARBUS  architectures,  respectively.  In 
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section  8,  we  propose  a  sorting  algorithm  that  is  a  combination  of  rotate  sort  and  Scher- 
son  et  al.'s  [SCHER89]  iterative  shear  sort.  In  section  9,  we  provide  a  comparison  of  the 
two  PARBUS  sorting  algorithms  developed  here  and  those  of  Jang  and  Prasanna 
[JANG92a]  and  Lin  et  al.  [LIN92].  For  the  PARBUS  model,  Leighton's  column  sort  uses 
the  fewest  bus  broadcasts.  However,  for  the  RMESH  model,  combined  sort  uses  the 
fewest  bus  broadcasts.  In  section  10,  we  make  the  observation  that  using  rotate  sort,  one 
can  sort  Nk  elements,  in  0(1)  time  onanN  x  N  x  •••  x  N  k  +  l  dimensional  RMESH 
and  PARBUS. 

Column  Sort 


Column  sort  is  a  generalization  of  Batcher's  odd-even  merge  [KNUT73]  and  was 
proposed  by  Leighton  [LEIG85].  It  may  be  used  to  sort  an  r  x  s  matrix  Q  where 
r  >  2(s-l)2  and  r  mod  s  =  0.  The  number  of  elements  in  Q  is  n  =  rs  and  the  sorted 
sequence  is  stored  in  column  major  order  (Figure  2.1).  Our  presentation  of  column  sort 
follows  that  of  [LEIG85]  very  closely. 

There  are  eight  steps  to  column  sort.  In  the  odd  steps  1,3,5,  and  7,  we  sort  each 
column  of  Q  top  to  bottom.  In  step  2,  the  elements  of  Q  are  picked  up  in  column  major 
order  and  placed  back  in  row  major  order  (Figure  2.2).  This  operation  is  called  a  tran- 
spose. Step  4  is  the  reverse  of  this  (i.e.,  elements  of  Q  are  picked  up  in  row  major  order 

and  put  back  in  column  major  order)  and  is  called  untranspose.  Step  6  is  a  shift  by  [  —  J  • 
This  increases  the  number  of  columns  by  1  and  is  shown  in  Figure  2.3.  Step  8,  unshift,  is 
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Figure  2.1  Sorting  a  9x3  matrix  Q  into  column  major  order 
the  reverse  of  this.  Leighton  [LEIG85]  has  shown  that  these  eight  steps  are  sufficient  to 
sort  Q  whenever  r  >  2(s  - 1  )2  and  r  mod  s  =  0. 
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Figure  2.2  Transpose  (step  2)  and  Untranspose  (step  4)  of  a  9x3  matrix 
Column  Sort  On  An  RMESH 

Our  adaptation  of  column  sort  to  an  n  x  n  RMESH  is  similar  to  the  adaptation  used 
by  Ben-Asher  et  al.  to  obtain  an  0(/i17/9)  processor  reconfigurable  network  that  can  sort 
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Figure  2.3  Shift  (step  6)  and  Unshift  (step  8) 
in  O(l)  time.  This  reconfigurable  network  uses  approximately  11  layers. 

The  input  n  numbers  to  be  sorted  reside  in  the  Z  variables  of  the  row  1  PEs  of  an 
n  x  n  RMESH.  This  is  interpreted  as  representing  the  column  major  order  of  the  Q 
matrix  used  in  a  column  sort  (Figure  2.4).  The  dimensions  of  Q  are  ri  x  s\  where  r\  = 
1/2*H3M  and  sx  =  2*/i1M.  Clearly,  there  is  an  n\  such  that  f\  >  2*(si-l)2  for  all 
n  £  n  i .  Hence,  column  sort  will  work  for  n  £  n  i .  For  n  <  n  i  we  can  sort  in  constant 
time  (as  /i  i  is  a  constant)  using  any  previously  known  RMESH  sorting  algorithm. 

The  steps  in  the  sorting  algorithm  are  given  in  Figure  2.5.  Steps  1,  2,  and  3  use  the 
n  x  r  i  sub  RMESH  A ,  to  sort  column  i  of  Q.  For  the  sort  of  step  4,  the  Q  matrix  has 
dimensions  rj  x  (st  +1).  Columns  1  and  ^i+  1  are  already  sorted  as  the  -<»'s  and  -n»'s 
of  Figure  2.3  do  not  affect  the  sorted  order.  Only  the  inner  S\  -  1  columns  need  to  be 
sorted.  Each  of  these  columns  is  sorted  using  an  it  X  T]  sub  RMESH,  Bt,  as  shown  in 
Figure  2.6.  The  sub  RMESHs  X  and  Y  are  idle  during  this  sort. 
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Figure  2.4  Column  major  interpretation  of  Q 


Step  0:  [Input]  Q  is  available  in  column  major  order,  one  element  per  PE,  in  row  1  of 
the  RMESH.  I.e.,  z[  1  J]  =  /th  element  of  Q  in  column  major  order. 

Step  1:  [Sort  Transpose]  Obtain  the  Q  matrix  following  step  2  of  column  sort.  This 
matrix  is  available  in  column  major  order  in  each  row  of  the  RMESH. 

Step  2:  [Sort  Untranspose]  Obtain  the  Q  matrix  following  step  4  of  column  sort.  This 
matrix  is  available  in  column  major  order  in  each  row  of  the  RMESH. 

Step  3:  [Sort  Shift]  Obtain  the  Q  matrix  following  step  6  of  column  sort.  This  matrix 
(excluding  the  -°°  and  -h»  values  )  is  available  in  column  major  order  in  each 
row  of  the  RMESH. 

Step  4:    [Sort  Unshift]  Obtain  the  sorted  result  in  row  1  of  the  RMESH. 


Figure  2.5    Sorting  Q  on  an  RMESH 
Thus  the  sorts  of  each  of  the  above  four  steps  involve  sorting  r  \  numbers  using  an 
n  x  r  i  sub  RMESH.  Each  of  these  is  done  using  column  sort  with  the  Q  matrix  now 
being  an  r2  x  j2  matrix  with  r^Si  =  r%.  We  use  r2  =  n1'2  and  $2  ■  1/2*«1/4.  With 
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1/2  ^  „l/2 


this  choice,  we  have  2($2-l)    <  2s2    =  l/2*n1/z  <  n1    ,  n  >  1.  We  use  W  to  denote 


the  /"2  x  52  matrix. 
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Figure  2.6   Sub  RMESHs  for  step  4  of  sort 


Let  us  examine  step  1  of  Figure  2.5.  To  sort  a  column  of  Q  we  initially  use  only  the 
nxr]  sub  RMESH  At  that  contains  this  column.  This  column  is  actually  the  column 
major  representation  of  a  matrix  W,.  So,  sorting  the  column  is  equivalent  to  sorting  Wt. 
To  sort  Wi,  we  follow  the  steps  given  in  Figure  2.5  except  that  Q  is  replaced  by  W,-.  The 
steps  of  Figure  2.5  are  done  differently  on  Wi  than  on  Q.  Figure  2.7  gives  the  necessary 
steps  for  each  W,-.  Note  that  this  figure  assumes  that  the  n  x  n  RMESH  has  been  parti- 
tioned into  the  s  i  A  ,'s  and  each  A  ,•  operates  independent  of  each  other.  To  SortTranspose 
a  Wj,  we  first  broadcast  Wi  which  is  initially  stored  in  the  Z  variables  of  the  row  1  PEs  of 


A  i  to  all  rows  of  A  ,•  (step  1).  Following  this,  the  U  variable  of  each  PE  in  each  column  of 
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Ai  is  the  same.  In  steps  2-4,  each  Ai  is  partitioned  into  square  sub  RMESHs  fly*, 
1  <,  j  <  r<i%  1  <  k  £  $2  of  size  r2  x  r2  (Figure  2.8).  Note  that  fly*  contains  column  /: 
of  Wi  in  the  £/  variables  of  each  of  its  rows.  B</*  will  be  used  to  determine  the  rank  of 
;'th  element  of  the  £'th  column  of  W,-.  Here,  by  rank  we  mean  the  number  of  elements  in 
the  &'th  column  of  W>  that  are  either  less  than  the/th  element  or  are  equal  to  it  but  not  in 
a  column  to  the  right  of  the  y'th  column  of  fly*.  So,  this  rank  gives  the  position,  in 
column  k,  of  they'th  element  following  a  sort  of  column  k.  To  determine  this  rank,  in  step 
2,  we  broadcast  the  y'th  element  of  column  k  of  Wi  to  all  processors  in  B  y*.  This  is  done 
by  broadcasting  U  values  from  column  j  of  B  yt  using  row  buses  that  are  local  to  B  y*. 
The  broadcast  value  is  stored  in  the  variables  of  the  PEs  of  B  y*.  Now,  the  processor  in 
position  (a,b)  of  Bijk  has  the  6'th  element  of  column  k  of  Wi  in  its  U  variable  and  the 
a'th  element  of  this  column  in  its  V  variable.  Step  3  sets  5  variables  in  the  processors  to 
0  or  1  such  that  the  sum  of  the  5"s  in  each  row  of  fl^*  gives  the  rank  of  they'th  element 
of  column  k  of  Wi.  In  step  4,  the  S  values  in  a  row  are  summed  to  obtain  the  rank.  We 
shall  describe  shortly  how  this  is  done.  The  computed  rank  is  stored  in  variable  R  of  the 
PE  in  position  [k,l]  of  fly*.  Note  that  since  k  ^  S2  ^  f"2>  1^.1]  is  actually  a  position  in 
Bijk- 

Now  if  our  objective  is  simply  to  sort  the  columns  of  Wi,  we  can  route  the  V  vari- 
able in  PE[£,i]  (  this  is  they'th  element  in  column  k  of  W{)  to  the  fl'th  column  using  a 
row  bus  local  to  fly*  and  then  broadcast  it  along  a  column  bus  to  all  PE's  on  column  R. 
However,  we  wish  to  transpose  the  resulting  W,-  which  is  sorted  by  columns.  For  this,  we 
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Step  1:     Broadcast  the  row  1  Z  values,  using  column  buses,  to  the  U  variables  of  all  PEs 
in  the  same  column  of  A,-. 

Step  2:     The  column  j  PEs  in  all  fi;y*'s  broadcast  their  U  values,  using  row  buses  local 
to  each  B  p,  to  the  V  variables  of  all  PEs  in  the  same  B  y^. 

Step  3:     PE  [a,b]  of  Bijk  sets  its  S  value  to  1  if  (U  <  V)  or  (U  =  V  and  b  <  J). 
Otherwise,  it  sets  its  5  value  to  0.  This  is  done,  in  parallel,  by  all  PEs  in  all 

Step  4:     The  sum  of  the  S's  in  any  one  row  of  Bug  is  computed  and  stored  in  the  R 
variable  of  PE[£,1]  of  By*.  This  is  done,  in  parallel,  for  all  £y*'s. 

Step  5:     Using  row  buses,  PE[Jfc,l]  of  each  fiy*  sends  its  V  value  to  the  PE  in  column 
[(R-l)  mod  s2]r2  +  (k-l)r2/s2  +  \  —  1  cMj. 

Step  6:     The  PEs  that  received  V  values  in  step  5  broadcast  this  value,  using  column 
buses,  to  the  U  variables  of  the  PEs  in  the  same  column  of  A , . 


Figure  2.7  Steps  to  Sort  Transpose  Wi  into  A  ,• 

see  that  they" th  element  of  W ',•  will  be  in  column  k  and  row  R  of  the  sorted  W,-.  Following 

p 

the  transpose,  this  element  will  be  in  row  (k-l)r2/s2  +  [  — 1  and  column  (R-l)  mod  s2  + 

*2 

1  of  Wi.  Hence,  the  V  value  in  PE[&,1]  of  Bijk  is  t0  De  i°  all  PEs  of  column  [(R-l)  mod 

■S2lr2  +(k-l)r2/s2  +f  — 1  of  Aj  following  the  SortTranspose.  This  is  accomplished  in 

*2 

steps  5  and  6.  Note  that  there  are  no  bus  conflicts  in  the  row  bus  broadcasts  of  step  5  as 
the  broadcast  for  each  B  ^  uses  a  different  row  bus  that  spans  the  width  of  A ,-.  The  total 
number  of  broadcasts  is  4  plus  the  number  needed  to  sum  the  S  values  in  a  row  of  But. 


33 


We  shall  shortly  see  that  summing  can  be  done  with  6  broadcasts.  Hence  SortTranspose 
uses  10  broadcasts. 
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Figure  2.8  Division  of  A,-  into  r%  x  r2  sub  RMESHs 


To  sum  the  S  values  in  a  row  of  an  r2  x  r^  RMESH,  we  use  a  slightly  modified 
version  of  the  ranking  algorithm  of  Jenq  and  Sahni  [JENQ91ab].  This  algorithm  ranks 
the  row  1  processors  of  an  r2  x  r2  RMESH.  The  ranking  is  done  by  using  0/1  valued 
variables,  5,  in  row  1.  The  rank  of  the  processor  in  column  i  of  row  1  is  the  number  of 
processors  in  row  1  and  columns  k,k<i  that  have  5=1.  Hence,  the  rank  of  processor 
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[1^2]  equals  the  sum  of  the  S  values  in  row  1  except  when  S[l,r2]  =  1.  In  this  latter 
case  we  need  to  add  1  to  the  rank  to  get  the  sum.  The  ranking  algorithm  of  [JENQ91ab] 
has  three  phases  associated  with  it. 


Phase  1:  Rank  the  processors  in  even  columns  of  row  1.  This  does  not  account  for  the 
l's  in  the  odd  column. 

Phase  2:  Phase  2:  Rank  the  processors  in  odd  columns  of  row  1.  This  does  not  account 
for  1  's  in  the  even  columns. 

Phase  3:   Combine  odd  and  even  ranks. 


The  procedures  for  phases  1  and  2  are  quite  similar.  These  are  easily  modified  to 
start  with  the  configuration  following  step  3  of  Figure  2.7  and  send  the  phase  1  and  phase 
2  results  of  the  rightmost  odd  and  even  columns  inSg*  to  PE[fc,l]  where  the  two  results 
are  added.  To  avoid  an  extra  broadcast,  the  result  for  the  rightmost  column  (phase  1  if  r2 
is  even  and  phase  2  if  r2  is  odd)  is  incremented  by  1  in  case  S[*,r2]  =  1  before  the 
broadcast.  While  the  phase  1  code  of  [JENQ91ab]  uses  4  broadcasts,  the  first  of  these  can 
be  eliminated  as  we  begin  with  a  configuration  in  which  S[*,b]  is  already  on  all  rows  of 
column  b,  1  <  b  <  r2.  So,  phases  1  and  2  use  6  broadcasts.  Phase  3  of  [JENQ91ab]  uses 
two  broadcasts.  Both  of  these  can  be  eliminated  by  having  their  phase  1  (2)  step  10 
directly  broadcast  the  rank  of  the  rightmost  even  (odd)  column  to  PE[£,1]  bus  using  a 
row  bus  that  spans  row  k  connected  to  a  column  bus  that  spans  the  rightmost  even  (odd) 
column  of  fl*».  So,  the  summing  operation  of  step  4  can  be  done  using  a  total  of  6 
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broadcasts. 

Analysis  of  RMESH  Column  Sort 

First,  let  us  consider  sorting  a  column  of  Q  which  is  an  r\  x  s\  matrix.  This 
requires  us  to  perform  steps  1-4  of  Figure  2.5  on  the  Wfs.  As  we  have  just  seen,  step  1 
uses  10  broadcasts.  Step  2  is  similar  to  step  1  except  that  we  begin  with  the  data  on  all 
rows  of  A  i  and  instead  of  a  transpose,  an  untranspose  is  to  performed.  This  means  that 
step  1  of  Figure  2.7  can  be  eliminated  and  the  formula  in  step  5  is  to  be  changed  to 
correspond  to  an  untranspose.  The  number  of  broadcasts  for  step  2  of  Figure  2.5  is  there- 
fore 9.  Steps  3  and  4  are  similar  and  each  uses  9  broadcasts.  The  total  number  of  broad- 
casts to  sort  a  column  of  Q  is  therefore  37. 

Now,  to  perform  a  SortTranspose  of  the  columns  of  Q  we  proceed  as  in  a  sort  of  the 
columns  of  Q  except  that  the  last  broadcast  of  SortUnshift  performs  the  transpose  and 
leaves  the  transposed  matrix  in  column  major  order  in  all  rows  of  the  RMESH.  This  takes 
37  broadcasts.  The  SortUntranspose  takes  36  broadcasts  as  it  begins  with  Q  in  all  rows. 
Similarly,  step  3  and  4  each  take  36  broadcasts.  The  total  number  of  broadcasts  is  there- 
fore 145. 

A  more  careful  analysis  reveals  that  the  number  of  broadcasts  needed  for  the  Sort- 
Shift  and  SortUnshift  steps  can  be  reduced  by  one  each  as  the  step  5  (Figure  2.7)  broad- 
cast can  be  eliminated.  Taking  this  into  account,  the  number  of  broadcasts  to  sort  a 
column  of  Q  becomes  35.  However  to  do  a  SortTranspose  of  the  columns  of  Q,  we  need 
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to  do  an  additional  broadcast  during  the  SortUnshift  of  the  W,'s.  This  brings  the  number 
to  36.  A  SortUntranspose  of  Q  takes  35  broadcasts  and  the  remaining  two  steps  of  Figure 
2.2  each  takes  34  broadcasts.  Hence,  the  total  number  of  broadcasts  becomes  139. 

Column  Sort  On  A  PARBUS 
The  RMESH  algorithm  of  Section  3  will  work  on  a  PARBUS  as  all  connections 
used  by  it  are  possible  in  a  PARBUS.  We  can,  however,  sort  using  fewer  than  139  broad- 
casts on  a  PARBUS.  If  we  replace  the  ranking  algorithm  of  [JENQ91ab]  that  we  used  in 
Section  3  by  the  prefix  sum  of  [LIN92]  (  i.e.,  prefix  sum  N  bits  on  a  (N+l)  x  N 
PARBUS)  then  we  can  sum  the  S's  in  a  row  using  two  broadcasts.  The  algorithm  of 
[LIN92]  needs  to  be  modified  slightly  to  allow  for  the  fact  that  we  begin  with  the  S 
values  on  all  columns  and  we  are  summing  r-i  S  values  on  a  r2  xr2  PARBUS  rather 
than  an  (r2  +  l)  x  r2  PARBUS.  These  modifications  are  straightforward.  Since  the  S's 
can  be  summed  in  2  broadcasts  rather  than  6,  the  SortTranspose  of  Figure  2.5  requires 
only  6  broadcasts.  The  SortUntranspose  can  be  done  in  5  broadcasts.  As  indicated  in  the 
analysis  of  Section  3,  the  SortShift  and  SortUnshift  steps  can  be  done  without  the  broad- 
cast of  Step  5  of  Figure  2.7.  So,  each  of  these  require  only  4  broadcasts.  Thus,  to  sort  a 
column  of  Q  requires  19  broadcasts.  Following  the  analysis  of  Section  3,  we  see  that  a 
SortTranspose  of  Q  can  be  done  with  20  broadcasts,  a  SortUntranspose  with  19  broad- 
casts, and  a  SortShift  and  SortUnshift  with  18  broadcasts  each.  Thus  n  numbers  can  be 
sorted onannx/i PARBUS  using 75  broadcasts. 
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We  can  actually  reduce  the  number  of  broadcasts  further  by  beginning  with  r\  = 
n213  and  S\  =  n1'3.  While  this  does  not  satisfy  the  requirement  that  r  £  2(s -l)2, 
Leighton  [LEIG85]  has  shown  that  column  sort  works  for  r  and  s  such  that  r  £  s(s- 1) 
provided  that  the  Untranspose  of  step  4  is  replaced  by  an  Undiagonalize  step  (Figure 
2.9).  The  use  of  the  undiagonalizing  permutation  only  requires  us  to  change  the  formula 
used  in  step  5  of  Figure  2.7  to  a  slightly  more  complex  one.  This  does  not  change  the 
number  of  broadcasts  in  the  case  of  the  RMESH  and  PARBUS  algorithms  previously  dis- 
cussed. However,  the  ability  to  use  r\  =  n2'3  and  Si  ■  n1'3  ( instead  of  r\  =2  n2'3  and 
s\  =  l/2*rt1/3  which  satisfy  r  £  2(5- 1)2  )  significantly  reduces  the  number  of  broad- 
casts for  the  PARBUS  algorithm  we  are  about  to  describe. 

Now,  the  SortTranspose,  SortUntranspose,  SortShift  and  SortUnshift  for  Q  are  not 
done  by  using  another  level  of  column  sort.  Instead  a  count  sort  similar  to  that  of  Figure 
2.7  is  directly  applied  to  the  columns  of  Q.  This  time,  A,-  is  an  n  x  r\  =  n  x  n2'3  sub 
PARBUS.  LetDi;  be  the/th  (from  the  top)  nm  x  rx  sub  PARBUS  of  At  (Figure  2.10). 
The  Dy's  are  used  to  do  the  counting  previously  done  by  the  #;;*'s.  To  count  r\  =n213 
bits  using  an  n1'3  x  n2'3  sub  PARBUS,  we  use  the  parallel  prefix  sum  algorithm  of 
[LIN92]  which  does  this  in  12  broadcasts  when  we  begin  with  bits  in  all  rows  of  Diy  and 
take  into  account  we  want  only  the  sum  and  not  the  prefix  sum.  Note  that  the  prefix  sum 
algorithm  of  [LIN92]  is  an  iterative  algorithm  that  uses  modulo  M  arithmetic  to  sum  iV 

bits  on  an  (M  + 1 )  x  N  PARBUS.  For  this  it  uses      ^      iterations.  For  the  case  of  sum- 


log  M 
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ming  N  bits,  this  is  easily  modified  to  run  on  an  M  x  N  PARBUS  in      *      iterations 

with  each  iteration  using  6  broadcasts  (  3  for  the  odd  bits,  and  3  for  the  even  bits).  With 
our  choice  of  r\  and  s\,  the  number  of  iterations  is  2,  while  with  r\  =2  n2'3  and  S\  = 
1/2  n 1/3,  the  number  of  iterations  is  3.  The  two  iterations,  together,  use  12  broadcasts. 
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Figure  2.9  Undiagonalize  a  9x3  matrix 
To  get  the  SortTranspose  algorithm  for  the  PARBUS,  we  replace  all  occurrences  of 
Bijk  by  Dij  ^d  of  PE[£,1]  by  PE[/,1]  in  Figure  2.7.  The  formula  of  step  5  is  changed  to 
[(/?-l)  mod  S\]  rx  +{i-\)r\ls\  +  \ R/S\~\.  The  number  of  broadcasts  used  by  the  new 
SortTranspose  algorithm  is  16.  The  remaining  three  steps  of  Figure  2.7  are  similar.  The 
SortUndiagonalize  takes  15  broadcasts  as  it  begins  with  data  in  all  rows  and  the  step  1 
(Figure  2.3)  broadcast  can  be  eliminated.  The  SortShift  and  SortUnshift  each  can  be  done 
in  14  broadcasts  as  the  step  5  (Figure  2.7)  broadcasts  are  unnecessary.  So,  the  number  of 


broadcasts  in  the  one  level  PARBUS  column  sort  algorithms  is  59. 
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Figure  2.10  Decomposing  A ,•  into  Dy's 


Rotate  Sort 


Rotate  sort  was  developed  by  Marberg  and  Gafni  [MARB88]  to  sort  M  N  numbers 
on  an  A/  x  W  mesh  with  the  standard  four  neighbor  connections.  To  state  their  algorithm 
we  need  to  restate  some  of  the  definitions  from  [MARB88].  Assume  that  M  =  2s  and  N  = 
22t  where  s  £  t.  An  M  x  N  mesh  can  be  tiled  in  a  natural  way  with  tiles  of  size 
M  x  N112.  This  tiling  partitions  the  mesh  into  vertical  slices  (Figure  2.11(a)).  Similarly 
an  M  x  N  mesh  can  be  tiled  with  N1'2  x  N  tiles  to  obtain  horizontal  slices  (Figure 
2.11(b)).  Tiling  by  N1'2  x  N1/2  tiles  results  in  a  partitioning  into  blocks  (Figure 
2.11(c)).  Marberg  and  Gafni  define  three  procedures  on  which  rotate  sort  is  based.  These 
are  given  in  Figure  2.12. 


Rotate  sort  is  comprised  of  the  six  steps  given  in  Figure  2.13.  Recall  that  a  vertical 
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Figure  2.11  Definitions  of  the  Slice  and  block 
slice  isanMxiV1'2  submesh;  a  horizontal  slice  is  an  N  x  N1/2  submesh;  and  a  block  is 
aiV1'2  x  Nm  submesh. 


Marberg  and  Gafni  [MARB88]  point  out  that  when  M  =  N,  step  1  of  rotate  sort 
may  be  replaced  by  the  steps. 

Step  1'   (a)  Sort  all  the  columns  downward; 
(b)  Sort  all  the  rows  to  the  right; 
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Procedure  balance  (v,w); 

{Operate  on  submeshes  of  size  v  x  w) 

sort  all  columns  of  the  submesh  downward; 

rotate  each  row  i  of  the  submesh,  i  mod  w  positions  right; 

sort  all  columns  of  the  submesh  downward; 


end; 


(a)  Balance  a  submesh 


Procedure  unblock; 

{Operate  on  entire  mesh} 

rotate  each  row  i  of  the  mesh  (i.N112 )  mod  N  positions  right; 

sort  all  columns  of  the  block  downwards; 

end; 

(b)  Unblock 

Procedure  shear, 

{Operate  on  entire  mesh} 

sort  all  even  numbered  rows  to  the  right  and  all  odd  numbered  rows  to  the  left; 

sort  all  columns  downward; 

end; 

(c)  Shear 

Figure  2.12  Procedures  from  [MARB88] 
This  simplifies  the  algorithm  when  it  is  implemented  on  a  mesh.  However,  it  does 
not  reduce  the  number  of  bus  broadcasts  needed  on  reconfigurable  meshes  with  buses. 
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Step  1:  balance  each  vertical  slice; 

Step  2:  unblock  the  mesh; 

Step  3:  balance  each  horizontal  slice  as  if  it  were  an  N  x  N1'2  mesh  lying  on  its  side; 

Step  4:  unblock  the  mesh; 

Step  5:  shear  three  times; 

Step  6:  sort  rows  to  the  right; 


Figure  2.13  Rotate  Sort  [MARB88] 
As  a  result  we  do  not  consider  this  variant  of  rotate  sort. 

Rotate  Sort  On  An  RMESH 

In  this  section,  we  adapt  the  rotate  sort  algorithm  of  Figure  2.13  to  sort  n  numbers 
on  an  n  x  n  RMESH.  Note  that  the  algorithm  of  Figure  2.13  sorts  MN  numbers  on  an 
M  x  N  mesh.  For  the  adaptation,  we  use  M  =  N  .  Hence,  n  =  N2  =  24'.  The  n  =  N2 
numbers  to  be  sorted  are  available,  initially,  in  the  Z  variable  of  the  row  1  PEs  of  the 
n  x  n  RMESH.  We  assume  a  row  major  mapping  from  the  N  x  N  mesh  to  row  1  of  the 
RMESH.  Figure  2.14  gives  the  steps  involved  in  sorting  the  columns  of  the  n  x  n  mesh 
downward.  Note  that  this  is  the  first  step  of  the  balance  operation  of  step  1  of  rotate  sort. 
The  basic  strategy  employed  in  Figure  2.14  is  to  use  each  «xJV  subRMESH  of  the 
n  x  n  RMESH  to  sort  one  column  of  the  N  x  N  mesh. 
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{ sort  columns  ofNxN  mesh } 

Step  1:  Use  column  buses  to  broadcast Z[  1  J]  to  all  rows  in  column  j,  1  <i<n.  Now, 
Z[i,j]  =Z[l,j],l<i<n,l<jZn. 

Step  2:  Use  row  buses  to  broadcast  Z[i,i]  to  the  R  variable  of  all  PEs  on  row  i  of  the 
RMESH.  Now, /?[/,./]  =  Z[i,i]  =  Z[l,i],  l<i£n,l  <j<n. 

Step  3:  In  the  q'th  n  x  N  subRMESH  all,  PEs[/,;']  such  that  i  mod  N  =  q  mod  N  and 
(j- 1 )  mod  N  +  I  =\ i/N]  broadcast  their R  values  along  the  column  buses,  1 
<>q  <N.  Note  that  each  such  PE[/J]  contains  the  { i/N~\  'th  element  of  column 
q  of  the  N  x  N  mesh.  This  value  is  broadcast  to  the  U  variables  of  the  PEs  in 
the  column.  Now,  each  row  of  the  q'th  n  x  N  subRMESH  contains  in  its  U 
variables  column  q  of  the  n  x  n  mesh. 

Step  4:  Now  assume  that  the  n  x  n  RMESH  is  tiled  by  N2  N  x  N  subRMESHs.  In  the 
[a,b]'th  such  subRMESH,  the  PEs  on  column  a  of  the  subRMESH  broadcast 
their  U  value  using  row  buses  local  to  the  subRMESH.  This  is  stored  in  the  V 
variable  of  each  PE. 

Step  5:  PE  [/,;]  of  the  [a,6]'th  subRMESH  sets  its  S  value  to  0  if  (U  <  V)  or 
(U  =  V  and  i  <  a). 

Step  6:  The  sum  of  the  S's  in  any  one  row  of  each  of  the  N  x  N  subRMESH's  is 
computed.  The  result  for  the  [a,6]'th  subRMESH  is  stored  in  the  T  variable  of 
PE[fc,l]. 

Step  7:  Using  the  row  buses  that  span  the  n  x  n  RMESH  the  PE  in  position  [b,l]  of 
each  N  x  N  subRMESH  [a, b]  sends  its  V  value  to  the  Z  variable  of  the  PE  in 
column  (b -l)N  +  7+1. 

Step  8:     The  received  Z  values  are  broadcast  along  column  buses. 


Figure  2.14  Sorting  the  columns  of&nNxN mesh 
For  this,  we  need  to  first  extract  the  columns  from  row  1  of  the  RMESH.  This  is 
done  in  steps  1-3  of  Figure  2.14.  Following  this,  each  row  of  the  <7'th  n  x  N  subRMESH 
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contains  the  <7'th  column  of  the  N  x  N  mesh.  Steps  4-6  implement  the  count  phase  of  a 
count  sort.  This  implementation  is  equivalent  to  that  used  in  [JENQ91b]  to  sort  m  ele- 
ments on  an  m  x  m  x  m  RMESH.  Steps  7  and  8  route  the  data  back  to  row  1  of  the 
RMESH  so  that  the  Z  values  in  row  1  (and  actually  in  all  rows)  correspond  to  the  row 
major  order  of  the  n  x  n  mesh  following  a  sort  of  its  columns.  The  total  number  of 
broadcasts  used  is  12  (  note  that  step  6  uses  6  broadcasts). 

The  row  rotation  required  by  procedure  balance  can  be  obtained  at  no  additional 
cost  by  changing  the  destination  column  computed  in  step  7  of  Figure  2.14  so  as  to 
account  for  the  rotation.  The  second  sort  of  the  columns  performed  in  procedure  balance 
can  be  done  with  9  additional  broadcasts.  For  this,  during  the  first  column  sort  of  pro- 
cedure balance,  the  step  7  broadcast  of  Figure  2.14  takes  into  account  both  the  row  rota- 
tion and  the  row  major  to  column  major  transformation  to  be  done  in  steps  1-3  of  Figure 
2.14  for  the  second  column  sort  of  procedure  balance.  So,  step  1  of  rotate  sort  (i.e., 
balancing  the  vertical  slices)  can  be  done  using  a  total  of  21  broadcasts. 

To  unblock  the  data,  we  need  to  rotate  each  row  and  then  sort  the  columns  down- 
ward. Once  again,  the  rotation  can  be  accomplished  at  no  additional  cost  by  modifying 
the  destination  column  function  used  in  step  7  of  the  second  column  sort  performed  dur- 
ing the  vertical  slice  balancing  of  step  1.  The  column  sort  needs  1 1  broadcasts. 

The  horizontal  slices  can  be  balanced  using  18  broadcasts  as  the  Z  data  are  already 
distributed  over  the  columns.  The  unblock  of  step  4  takes  as  many  broadcasts  as  the 
unblock  of  step  2  (i.e.  9). 
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The  shear  operation  requires  a  row  sort  followed  by  a  column  sort.  Row  sorts  are 
performed  using  the  same  strategy  as  used  for  a  column  sort  The  fact  that  all  elements  of 
a  row  are  in  adjacent  columns  of  the  RMESH  permits  us  to  eliminate  steps  1-3  of  Figure 
2.14.  So,  a  row  sort  takes  only  9  additional  broadcasts.  The  following  column  sort  uses  9 
broadcasts.  So,  each  application  of  shear  takes  18  broadcasts.  Since  we  need  to  shear 
three  times,  step  5  of  rotate  sort  uses  54  broadcasts.  Step  6  of  rotate  sort  is  a  row  sort. 
This  takes  9  broadcasts.  The  total  number  of  broadcasts  is  120.  This  is  19  fewer  than  the 
number  of  broadcasts  used  by  our  RMESH  implementation  of  column  sort. 

Rotate  Sort  On  A  PARBUS 


Our  implementation  of  rotate  sort  on  a  PARBUS  is  the  same  as  that  on  an  RMESH. 
Note,  however,  that  on  a  PARBUS  ranking  (step  6  of  Figure  2.14)  takes  only  3  broad- 
casts. Since  this  is  done  once  for  each/row  column  sort  and  since  a  total  of  13  such  sorts 
is  done,  39  fewer  broadcasts  are  needed  on  a  PARBUS.  Hence  our  PARBUS  implemen- 
tation of  rotate  sort  takes  81  broadcasts.  Recall  that  Leighton's  column  sort  could  be 
implemented  on  a  PARBUS  using  only  59  broadcasts. 
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A  Combined  Sort 

We  may  combine  the  first  three  steps  of  the  iterative  shear  sort  algorithm  of  Scher- 

son  et  al.  [SCHER89]  with  the  last  four  steps  of  rotate  sort  to  obtain  combined  sort  of 

Figure  2.15.  This  is  stated  for  an  N  x  N  mesh  using  nearest  neighbor  connections.  The 

number  of  elements  to  sorted  isN  . 

Step  1:  Sort  each  iV3M  x  N314  block; 

Step  2:  Shift  the  fth  row  by  (i*N314)  mod  N  to  the  right,  I  £  i  <  N\ 

Step  3:  Sort  the  columns  downward; 

step  4:  balance  each  horizontal  slice  as  if  it  were  an  N  x  N1'2  mesh  lying  on  its  side; 

Step  5:  unblock  the  mesh; 

Step  6:  shear  two  times; 

Step  7:  sort  rows  to  the  right; 

Figure  2.15  Combined  Sort 

Notice  that  step  4-7  of  Figure  2.15  differ  from  steps  3-6  of  Figure  2.13  only  in  that 
in  step  6  of  Figure  2.15  the  shear  sort  is  done  two  times.  The  correctness  of  Figure  23 
may  be  established  using  the  results  of  [SCHER89]  and  [MARB88]. 

To  implement  the  combined  sort  on  an  n  x  n  RMESH  ornxn  PARBUS  (n  =  N2 
elements  to  be  sorted),  we  note  that  the  column  sort  of  step  3  can  be  done  in  the  same 
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manner  as  the  column  sorts  of  rotate  sort  are  done.  The  shift  of  step  2  can  be  combined 
with  the  sort  of  step  1.  The  block  sort  of  step  1  is  done  using  submeshes  of  size 
n  x  n3M  =  JV2  x  N3'2  =  JV2  x  (  N2'4  x  N2'4).  On  a  PARBUS,  this  is  done  by  ranking 
in  n1/4  x  n2'4  as  in  [LIN92]  while  on  an  RMESH,  this  is  done  using  the  algorithm  to 
sort  a  column  of  Q  using  an  n  x  r\  submesh  (section  3).  We  omit  the  details  here.  The 
number  of  broadcasts  is  77  for  PARBUS  and  1 18  for  an  RMESH. 

Comparison  With  Other  O(l)  PARBUS  Sorts 

As  noted  earlier,  our  PARBUS  implementation  of  Leighton's  column  sort  uses  only 
59  broadcasts  whereas  our  PARBUS  implementation  of  rotate  sort  uses  81  broadcasts 
and  our  implementation  of  combined  sort  uses  77  broadcasts.  The  O(l)  PARBUS  sorting 
algorithm  of  Jang  and  Prasanna  [JANG92a]  is  also  based  on  column  sort.  However,  it  is 
far  more  complex  than  our  adaptation  and  uses  more  broadcasts  than  does  the  O(l) 
PARBUS  algorithm  of  Lin  et  al.  [LIN92].  So,  we  compare  our  algorithm  to  that  of 
[LIN92].  This  latter  algorithm  is  not  based  on  column  sort.  Rather,  it  is  based  on  a  multi- 
ple selection  algorithm  that  the  authors  develop.  This  multiple  selection  algorithm  is 
itself  a  simple  modification  of  a  selection  algorithm  for  the  PARBUS.  This  algorithm 
selects  the  £'th  largest  element  of  an  unordered  set  S  of  n  elements.  The  multiple  selec- 
tion algorithm  takes  as  input  an  increasing  sequence  q\  <  qi  <  —  <  qn113  with  1  <  qi  <  n 
and  reports  for  each  i,  the  <7,'th  largest  element  of  5.  By  selecting  qi  =  i*n2'3, 

1  f\ 

1  <  i  £  n       one  is  able  to  determine  partitioning  elements  such  that  the  set  of  n 
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numbers  to  be  sorted  can  be  partitioned  into  n1/3  buckets  each  having  n213  elements. 
Each  bucket  is  then  sorted  using  an  n  x  n2'3  sub  PARBUS.  Lin  et  al.  [LIN92]  were  only 
concerned  with  developing  a  constant  time  algorithm  to  sort  n  numbers  on  an  n  x  n 
PARBUS.  Consequently,  they  did  not  attempt  to  minimize  the  number  of  broadcasts 
needed  for  a  sort.  However,  we  analyzed  versions  of  their  algorithms  that  were  optimized 
by  us.  The  optimized  selection  algorithm  requires  84  broadcasts  and  the  optimized  sort 
algorithm  used  103  broadcasts.  Thus  our  PARBUS  implementation  of  Leighton's  column 
sort  uses  slightly  more  than  half  the  number  of  broadcasts  used  by  an  optimized  version 
of  LIN  et  al.'s  algorithm.  Furthermore,  even  if  one  were  interested  only  in  the  selection 
problem,  it  would  be  faster  to  sort  using  our  PARBUS  implementation  of  Leighton's 
column  sort  algorithm  and  then  select  the  fc'th  element  than  to  use  the  optimized  version 
of  the  PARBUS  selection  algorithm  of  [LIN92].  Our  algorithm  is  also  conceptually 
simpler  than  those  of  [JANG92a]  and  [LIN92].  Like  the  algorithms  of  [JANG92]  and 
[LIN91],  our  PARBUS  algorithms  may  be  run  directly  on  an  n  x  n  MRN.  The  number  of 
broadcasts  remains  unchanged. 

Sorting  On  Higher  Dimension  RMB 

Rotate  sort  works  by  sorting  and/or  shifting  rows  and  columns  of  an  N  x  N  array. 
This  algorithm  may  be  implemented  on  a  three  dimensional  N  x  N  x  N  reconfigurable 
mesh  with  buses  so  as  to  sort  N2  elements  in  0(1)  time.  In  other  words,  n  =  N  x  N  ele- 
ments are  being  sorted  in  0(1)  time  on  an  n112  x  n1'2  x  n1'2  RMB.  Assume  that  we 
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start  with  n  elements  on  the  base  of  an  N  x  N  x  N  RMB.  To  sort  row  (column)  i,  we 
broadcast  the  row  (column)  to  the  fth  layer  ofJVxiV  processors  and  sort  it  in  O(l)  time 
in  this  layer  using  the  algorithms  developed  in  this  chapter  to  sort  N  elements  on  an 
N  x  N  RMB.  The  total  number  of  such  sorts  required  by  rotate  sort  is  13  (i.e.,  0(1))  and 
the  required  shifts  may  be  combined  with  the  sorts. 

We  can  extend  this  to  obtain  an  algorithm  to  sort  Nk  numbers  on  a  k  + 1  dimen- 
sional RMB  with  Nk+1  processors  in  O(l)  time.  The  RMB  has  an  N  x  N  x  •  •  •  x  N 
configuration  and  the  Nk  numbers  are  initially  in  the  face  with  £  +  1  dimension  equal  to 
zero.  In  the  preceding  paragraph  we  showed  how  to  do  this  sort  when  k  =  2.  Suppose  we 
have  an  algorithm  to  sort  N1'1  numbers  on  an  /  dimensional  N  processor  RMB  in  O(l) 
time.  We  can  use  this  to  sort  Nl  numbers  on  an  Nl+1  processor  RMB  in  O(l)  time  by 
regarding  the  Nl  numbers  as  forming  an  Nl  ~ l  xiV  array.  In  the  terminology  of  Marberg 
and  Gafni  [MARB88],  we  have  anMxiV  array  with  M  =  N1'1  >JV.To  use  rotate  sort, 
we  need  to  be  able  to  sort  the  columns  of  this  array  (which  are  of  size  M);  sort  the  rows 
which  are  of  size  N);  and  perform  shifts/rotations  on  the  rows  or  subrows. 

To  do  the  column  sort  we  use  /  dimensional  RMBs.  The  mth  such  RMB  consists  of 
all  processors  with  index  of  the  type  [/i,z'2,...,»/-i,#M/+i]-  By  assumption,  this  RMB 
can  sort  its  Nl_1  numbers  in  0(1)  time.  To  sort  the  rows,  we  can  use  two  dimensional 
RMBs.  Each  such  RMB  consists  of  processors  that  differ  only  in  their  last  two  dimen- 
sions (i.e.,  [a,b,c,...,ii,it+1]).  This  sort  is  done  using  the  0(1)  sorting  algorithm  for  two 
dimensional  RMBs.  The  row  shifts  and  rotates  are  easily  done  in  0(1)  time  using  the  two 
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dimensional  RMBs  just  described  (actually  most  of  these  can  be  combined  with  one  of 
the  required  sorts). 

Conclusions 

We  have  developed  relatively  simple  algorithms  to  sort  n  numbers  on 
reconfigurable  n  x  n  meshes  with  buses.  For  the  case  of  the  RMESH,  our  algorithms  are 
the  first  to  sort  in  0(1)  time.  For  the  PARBUS,  our  algorithms  are  simpler  than  those  of 
[JANG92a]  and  [LIN92].  Our  PARBUS  column  sort  algorithm  is  the  fastest  of  our  algo- 
rithms for  the  PARBUS.  It  uses  fewer  broadcasts  than  does  the  optimized  versions  of  the 
selection  algorithm  of  [LIN92].  Our  PARBUS  algorithms  can  be  run  on  an  MRN  with 
no  modifications.  Since  n  x  n  reconfigurable  meshes  require  n2  area  for  their  layout.  Our 
algorithms  (as  well  as  those  of  [JANG92a]  and  [LIN92])  have  an  area  time  square  pro- 
duct AT2  of  n2  which  is  the  best  one  can  hope  for  in  view  of  the  lower  bound  result  AT2 
Z  n2  for  the  VLSI  word  model  [LEIG85]. 

Using  two  dimensional  meshes  with  buses,  we  are  able  to  sort  n  elements  in  O(l) 
time  using  n2  processors.  Using  higher  dimensional  RMB,  one  can  sort  n  numbers  in 
0(1)  time  using  fewer  processors.  In  general,  n  =  N  numbers  can  be  sorted  in  0(1) 
time  using  Nk+1  =  n1  +  1,k  processors  in  a  it +  1  dimensional  configuration.  While  the 
same  result  has  been  shown  for  a  PARBUS  [JANG92b],  our  algorithm  applies  to  an 
RMESH  also. 


CHAPTER  3 

SORTING  n2  NUMBERS  ONnxn  MESHES 

Introduction 


In  this  chapter,  we  are  concerned  with  sorting  n  data  elements  using  an  nxn  mesh 
connected  parallel  computer.  The  initial  and  final  configurations  have  one  data  element  in 
each  of  the  nxn  processors  (say  in  the  A  variable  of  each  processor).  In  the  final 
configuration  the  data  elements  are  sorted  in  snake-like  row-major  order.  This  problem 
has  been  extensively  studied  for  mesh  architectures  (see  for  e.g.  [THOMK77], 
[NASS79],  [KUMA83],  [LEIG85],  [SCHN86],  [SCHER89],  [MARB88]).  While  all  of 
these  studies  consider  SIMD  meshes,  they  differ  in  the  permissible  communication  pat- 
terns. Thompson  and  Kung  [THOMK77]  consider  a  strict  unidirectional  model  in  which 
all  processors  that  simultaneously  transfer  data  to  a  neighbor  processor  do  so  to  the  same 
neighbor.  That  is,  all  active  processors  transfer  data  to  their  north  neighbor,  or  all  to  their 
south  neighbor,  etc.  Using  this  model,  Thompson  and  kung  [THOMK77]  have  developed 
a  sorting  algorithm  with  complexity  6ntr  +  ntc  +  low  order  terms,  where  tr  is  the  time 
needed  to  transfer  one  data  element  from  a  processor  to  its  neighbor  and  tc  is  the  time 
needed  to  compare  two  data  elements  that  are  in  the  same  processor.  In  the  sequel,  we 
omit  the  low  order  terms. 
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In  the  bidirectional  model,  we  assume  there  are  two  links  between  every  pair  (P,Q) 
of  neighbors.  As  a  result,  P  can  send  a  data  element  to  Q  at  the  same  time  that  Q  sends 
one  to  P.  Using  this  model,  Schnorr  and  Shamir  [SCHN86]  have  developed  a  sorting 
algorithm  with  complexity  3ntr  +  2ntc.  In  the  same  paper,  Schnorr  and  Shamir  "prove" 
that  3/i  routing  steps  are  needed  by  every  sorting  algorithm  for  an  even  stronger  mesh 
model,  each  processor  can  read  the  entire  contents  of  the  memories  of  its  (up  to)  four 
neighbors  in  time  tr  and  the  internal  processor  computations  are  free.  For  this  stronger 
mesh  model,  they  show  that  every  sorting  algorithm  must  take  3ntr  time  by  first  showing 
that  input  data  changes  in  the  lower  right  part  of  the  mesh  cannot  affect  the  values  in  the 
top  left  comer  processor  until  time  equal  to  the  distance,  d,  between  the  top  left  corner 
processor  and  the  nearest  of  these  lower  right  processors.  Next,  they  argue  that  by  chang- 
ing the  lower  right  input  data,  they  can  change  the  final  position  of  the  data  in  the  top  left 
comer  processor  by  distance  »  — 1.  As  a  result  the  sort  must  take  at  least  (d  +  n-2)tr 
time. 

The  fallacy  is  that  in  the  first  d—\  steps,  we  may  have  made  several  copies  of  the 
input  data  and  it  may  no  longer  be  necessary  to  route  the  data  from  the  top  left  corner 
processor  to  the  final  destination  processor.  In  fact,  we  show  that  one  can  sort  in  2ntr 
time  using  the  stronger  mesh  model. 

Park  and  Balasubramanium  [PARK87.90]  have  considered  a  related  sorting  prob- 
lem for  the  strict  unidirectional  model.  In  this  the  n2  elements  to  be  sorted  are  initially 
stored  two  to  a  processor  on  an  n  x  nil  mesh.  The  final  sorted  configuration  also  has  two 
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elements  per  processor.  The  time  complexity  of  their  algorithm  is  Antr  +  ntc.  This 
represents  an  improvement  of  2ntr  over  Thompson  and  Kung's  algorithm  provided  the 
two  element  per  processor  input/output  configuration  is  what  we  desire.  If  we  desire  the 
one  element  per  processor  configuration  (for  example,  the  data  to  be  sorted  is  the  result 
of  a  computation  that  produces  this  configuration  and  the  sorted  data  is  to  be  used  for 
further  computation),  then  it  is  necessary  to  first  fold  the  data  to  get  the  two  element  per 
processor  configuration,  then  sort  using  the  algorithm  of  [PARK87,  90],  and  finally 
unfold  to  get  the  desired  one  element  per  processor  final  configuration.  The  folding  can 
be  done  in  n/2  tr  time  as  below  (see  also  Figure  3.1(a)). 

Fl :         The  left  n/A  columns  shift  their  data  n/A  columns  to  right. 

F2:         The  right  n/A  columns  shift  their  data  n/A  columns  to  the  left. 

The  unfolding  can  also  be  done  in  nil  tr  time  using  the  two  steps: 

Ul :        Unfold  the  nIA  columns  labeled  A  in  Figure  3. 1  (b) 

U2:        Unfold  the  nIA  columns  labeled  B  in  Figure  3. 1  (b) 

To  unfold  A,  we  use  the  pipelined  process  described  by  the  example  of  Figure  3.2. 
B  is  unfolded  in  a  similar  way.  The  total  time  for  the  sort  is  therefore  5ntr  +  ntc  which  is 
ntr  less  than  that  of  Thompson  et  al  [THOMK77].  The  improvement  is  slightly  more  if 
we  consider  the  (nonstrict)  unidirectional  model  in  which  there  is  a  single  link  between 
each  pair  of  neighbor  processors  and  data  can  be  transferred,  in  parallel,  along  all  links  ( 
however,  if  P  and  Q  are  neighbors,  when  P  sends  data  to  Q,  Q  cannot  send  to  P).  In  this 
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Figure  3.1  Folding  and  Unfolding 


steps  Fl  and  Fl  of  folding  can  be  done  in  parallel.  Similarly,  we  can  do  Ul  and  Ul  in 
parallel.  The  total  sort  time  now  becomes  4.5  ntr  +  ntc. 

In  section  2,  we  show  that  the  Schnorr/Shamir  algorithm  of  [SCHN86]  can  be 
modified  to  sort  on  unidirectional  meshes  using  the  same  number  of  routes  as  above.  The 
number  of  comparison  steps  is,  however,  larger.  The  modified  Schnorr/Shamir  algorithm 
of  section  2  runs  in  2.5 ntr  +  3ntc  time  on  an  n  x  n  bidirectional  mesh.  The  number  of 
routes  is  therefore  less  than  the  3n  lower  bound  established  in  [SCHN86].  The  algorithm 
of  section  2  folds  data  onto  an  n  x  nil  submesh.  By  folding  onto  smaller  submeshes, 
i.e.,  onto  n  x  nlk  submeshes  for  k  >  2,  the  number  of  routes  can  be  reduced  further.  In 
fact  for  bidirectional  and  strict  unidirectional  meshes  we  can  come  very  close  to  the  dis- 
tance lower  bound  of  In  -2  and  4n  -2,  respectively.  This  is  done  in  section  3.  In  section 
4,  we  discuss  the  practical  implications  of  the  folding-unfolding  approach  to  sorting.  In 
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section  5,  we  show  how  the  sort  algorithms  of  sections  2  and  3  may  be  simulated  on 
n  x  n  reconfigurable  bus  architectures. 
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Figure  3.2  Unfolding  one  row  of  A 
Modified  Schnorr&Shamir  Algorithm 

The  sorting  algorithm  of  Schnorr  and  Shamir  [SCHN86]  is  given  in  Figure  3.3. 
This  algorithm  uses  the  following  terms  and  assumes  that  n  =  24q  for  some  integer  q. 

1.  Block...  an  n3'4  x  n314  submesh  formed  by  a  natural  tiling  of  an  n  x  n  mesh  with 
„3/4  x  n3/4  tiles  (Figure  3.4(a)). 

2.  Vertical  Slice...  an  n  x  n3'4  submesh  formed  by  a  natural  tiling  of  an  n  x  n  mesh  with 
nx  n3'4  tiles  (Figure  3.4(b)). 

3.  Snake...  the  1  x  n2  vector  along  the  snake-like  order  in  the  n  x  n  mesh  (Figure 
3.4(c)). 

4.  Even-odd  transposition  sort  [KNUT73]...«  elements  are  sorted  in  n  steps.  In  the  odd 
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steps  each  element  in  an  odd  position  is  compared  with  the  element  in  the  next  even  posi- 
tion and  exchanged  if  larger.  In  the  even  steps,  each  element  in  an  even  position  is  com- 
pared with  the  element  in  the  next  odd  position  and  exchanged  if  greater. 

5.  n1/4-way  unshuffle  ...  Let  it  =  1/4  log2rt  =  <7/4.  Data  in  column  j  of  the  mesh  is 
moved  to  column  j'.  Lety'^-i...  jq  be  the  binary  representation  of  y.  Then  jk-i—jo 
jq-i—jk  is  the  binary  representation  of  j'.  The  unshuffle  distributes  the  n3/4  columns  of 
each  vertical  slice  to  the  n 1/4  vertical  slices  in  a  round  robin  manner. 

Step  1:  Sort  all  the  blocks  into  snake-like  row-major  order. 

Step  2:  Perform  a  n  1/4-way  unshuffle  along  all  the  rows  of  the  array. 

Step  3:  Sort  all  the  blocks  into  snake-like  row-major  order. 

Step  4:  Sort  all  the  columns  of  the  array  downwards. 

Step  5:  Sort  all  the  vertical  slices  into  snake-like  row-major  order. 

Step  6:  Sort  all  the  rows  of  the  array  into  alternating  left-to-right  and  right-to-left  order. 

Step  7:  Perform  2/z3/4  steps  of  even-odd  transposition  sort  along  the  snake. 

Figure  3.3  Sorting  algorithm  of  Schnorr  and  Shamir 
The  correctness  of  the  algorithm  is  established  in  [SCHN86].  As  pointed  out  in 
[SCHN86],  steps  1,3,5,7  take  0(«3M)  time  and  are  dominated  by  steps  2,4,6.  Steps  4  and 
6  are  done  using  n  steps  of  even-odd  transposition  sort  each  while  step  2  is  done  by 
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Figure  3.4  Definitions  of  Blocks  and  Slices 
simulating  even-odd  transposition  sort  on  the  fixed  input  permutation  7C-1  where  7t  is  the 
desired  unshuffle  permutation.  Steps  4  and  6  take  ntr  +  ntc  time  each  on  a  bidirectional 
mesh  and  2ntr  +  ntc  on  a  unidirectional  (whether  strict  or  not)  one.  Step  2  takes  ntr  time 
on  a  bidirectional  mesh  and  2ntr  on  a  unidirectional  one.  The  total  time  for  the  algorithm 
of  Schnorr  and  Shamir  is  therefore  3ntr  +  2ntc  on  a  bidirectional  mesh  and 
6ntr  +  2ntc  on  a  unidirectional  (strict  or  nonstrict)  mesh.  On  the  stronger  mesh  model 


the  time  is  3ntr  as  tc  is  zero. 


We  may  modify  the  algorithm  of  Schnorr  and  Shamir  so  that  it  works  on  an 
n  x  n/2  mesh  with  two  elements  per  processor.  In  this  modification,  we  essentially  simu- 


late the  algorithm  of  Figure  3.3  using  half  as  many  processors.  The  run  time  for  steps 
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1,3,5  and  7  increases  but  is  still  0(n3/4).  Step  6  takes  n/2  tr  +  ntc  on  a  bidirectional 
mesh  as  data  routing  is  needed  for  only  half  of  the  steps  of  even-odd  transposition  sort 
and  ntr  +  ntc  on  a  unidirectional  mesh.  Step  4  takes  2ntr  +  2ntc  on  a  bidirectional 
mesh  and  4ntr  +  2ntc  on  a  unidirectional  mesh  as  in  each  of  the  n  steps  of  even-odd 
transposition  sort,  both  the  elements  in  a  processor  need  to  be  routed  to  the  neighbor  pro- 
cessor. This  can  be  reduced  to  ntr  +  2ntc  and  2ntr  +  2ntc,  respectively,  by  regarding 
the  2n  elements  in  a  processor  column  as  a  single  column  and  sorting  this  into  row  major 
order  using  2n  steps  of  even-odd  transposition  sort  (Figure  3.5).  Now,  a  single  element 
from  each  processor  is  to  be  routed  on  each  of  n  steps  and  no  data  routing  is  needed  on 
the  remaining  n  steps. 
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Figure  3.5  Modified  Step  4 


To  establish  the  correctness  of  the  sorting  algorithm  with  step  6  changed  to  sort 
pairs  of  columns  as  though  they  were  a  single  column,  we  use  the  zero-one  principle 
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[KNUT73].  This  was  also  used  by  Schnorr  and  Shamir  to  establish  the  correctness  of 
their  unmodified  algorithm.  Here,  we  assume  that  the  input  data  consists  solely  of  zeroes 
and  ones.  Since  we  have  not  changed  steps  1-3  of  their  algorithm,  the  proofs  of  parts  1,2, 
and  3  of  their  Theorem  3  are  still  valid.  We  shall  show  that  part  4  remains  true  following 
the  execution  of  the  modified  step  4.  Since  the  proof  of  parts  5,6,  and  7  rely  only  part  4  of 
the  theorem,  these  are  valid  for  the  modified  algorithm.  Hence,  the  modified  algorithms 
is  correct. 

Define  a  horizontal  slice  to  be  an  n3/4  x  n  submesh  obtained  by  a  natural  tiling  of 
an  n  x  n  mesh  using  tiles  of  size  n3'4  x  n  (Figure  3.4(d)).  Following  step  3  of  the  sort- 
ing algorithm,  the  maximum  difference  d$  between  the  number  of  zeroes  in  any  two 
columns  of  a  horizontal  slice  is  at  most  2  [SCHN86].  We  need  to  show  (i)  that  following 
the  modified  step  4,  the  maximum  difference  d± '  between  the  number  of  zeroes  in  any 
two  columns  (a  column  is  an  n  x  1  vector  with  n  elements)  of  the  mesh  is  at  most  2n1/4, 
and  (ii)  the  maximum  difference  between  the  number  of  zeroes  in  any  two  vertical  slices 
is  atmost  n 1/2.  The  proof  of  (ii)  is  same  as  that  in  [SCHN86]  as  our  modification  of  step 
4  does  not  move  data  between  vertical  slices.  For  (i),  consider  any  two  columns  of  n  ele- 
ments each.  If  these  two  columns  reside  in  the  same  column  of  processors,  then  the  max- 
imum difference  between  the  number  of  zeroes  between  two  columns  is  1  as  the  two 
columns  have  been  sorted  into  row  major  order  regarding  them  as  a  single  2n  element 
column.  Suppose  the  two  element  columns  reside  in  two  different  processor  columns. 
These   two   processor  columns   contain   4   element   columns  AJi,CJD.   Let  a,b,c,d 
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respectively  be  the  number  of  zeroes  in  these  four  columns  prior  to  the  execution  of  the 
modified  step  4.  From  the  proof  of  Theorem  3,  part  4  [SCHN86],  we  know  that 
\x-y  |  <  2n114  where  jcje  {a,b,c,d}- 

Let  a',b',c',d'  be  the  number  of  zeroes  in  column  AJi,CJ)  following  the  modified 
step  4.  It  is  easy  to  see  that  a'  =  \(&+b)/2\ ,  b'  =  |_(a+b)/2| ,  c'  =  \(c+d)P\ ,  d'  =  L(c+d)/2j . 
We  need  to  show  that  |  x  -y  \  <.  2  n 1M  where  x,y  e  {a'Jb',c',d'}- 

Without  loss  of  generality  (wlog),  we  may  assume  that  |  b'-  c'\  =  max  {|  x  -y  \  | 
x,y  €  {a'Tb',c'4'\ }  (note  that  when  x,y  e  {a\  b'}  or  x,y  e  {c  V},  |  x-y  \  £  1 )  and  that 
b'  >  c'.  b'-c'  <  \(a+b)/2]  -  [(c+d)/2\.  Again,  wlog  we  may  assume  that  c  <  d. 
So,  b'-c' £\(a+b)/2]  -  c  .  Since,  \b-c\<2ny4  and  \a-c\<>2nllA, 
a  +  b£2c  +  4nv4.  So,b'-c'  <  c  +  2n1/4  -  c    =    2nm. 

Our  modified  Schnorr/Shamir  algorithm  for  n  x  n  meshes  is  stated  in  Figure  3.6.  It 
combines  the  folding  and  unfolding  steps  discussed  in  the  introduction  and  the 
Schnorr/Shamir  modified  algorithm  for  n  x  n/2  meshes  described  above. 

Theorem  1:  The  sorting  algorithm  of  Figure  3.6  is  correct. 

Proof:  Follows  directly  from  our  earlier  discussion  that  established  the  correctness  of 
steps  1  through  7  as  a  sorting  algorithm  for  an  n  x  n/2  mesh  with  two  elements  per  pro- 
cessor. □ 

For  the  complexity  analysis,  we  note  that  steps  1,3,5,  and  7  take  0(/t3/4)  time  and 
are  dominated  by  the  remaining  steps  which  take  0(h)  time  each.  Since  we  are  ignoring 
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Step  0:  Fold  the  leftmost  and  rightmost  n/A  columns  so  that  the  n2  elements  to  be  sorted 
are  in  the  middle  n  x  n/2  submesh  with  each  processor  in  this  submesh  having 
two  elements. 

Step  1:  Sort  all  n3'4  x  n314  blocks  of  data  into  snake-like  row -major  order.  Note  that 
each  block  of  data  is  in  an  n  3/4  x  (1/2  n  3/4 )  submesh. 

Step  2:  Perform  an  n1/4-way  unshuffle  along  each  row  of  n  elements.  Note  that  each 
row  of  n  elements  is  in  a  row  of  the  middle  n  x  (1/2  n)  submesh. 

« 
Step  3:  Repeat  step  1. 

Step  4:  Sort  the  2n  elements  in  each  column  of  the  middle  n  x  n/2  submesh  into  row- 
major  order. 

Step  5:  Sort  the  elements  in  each  n  x  (1/2  n314)  submesh  into  snake-like  row-major 
order. 

Step  6:  Sort  all  rows  of  the  middle  n  x  n/2  submesh  into  alternating  left-to-right  and 
right-to-left  order. 

Step  7:  Perform  2n3'4  steps  of  even-odd  transposition  sort  along  the  snake  of  the  middle 
n  x  n/2  submesh. 

Step  8:   Unfold  the  middle  n  x  n/2  submesh. 


Figure  3.6  Sorting  algorithm  forn  xn  mesh, 
low  order  terms,  we  need  be  concerned  only  with  steps  0,  2, 4,  6,  and  8.  As  noted  earlier, 
steps  2,  4  and  6,  respectively,  take  n/2  tr,  ntr  +  2ntc,  and  n/2  tr  +  ntc  on  a  bidirec- 
tional mesh.  Steps  0  and  8  each  take  n/A  tr  on  a  bidirectional  mesh.  So,  the  complexity 
of  the  sort  algorithm  on  a  bidirectional  mesh  is  2.5/1  tr  +  3n  tc.  Since  tc  is  zero  on  the 
stronger  mesh  model,  the  sort  time  for  this  model  is  2.5  n  tr  (note  that  the  algorithm  of 
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[SCHN86]  has  a  run  time  of  3/i  tr). 

On  a  (nonstrict)  unidirectional  mesh,  the  times  for  steps  0,2,4,6,  and  8  are,  respec- 
tively, nIA  tr,  n  tr,  2n  tr  +  In  tc,  n  tr  +  n  tc,  nIA  tr.  The  total  time  for  this  model  is 
therefore  4.5 n  tr  +  3/z  tc.  On  a  strict  unidirectional  mesh,  the  times  for  steps  0,2,4,6, 
and  8  are  respectively,  nil  tr,  n  tr,  In  tr  +  In  tc,  n  tr  +  n  tc,  nil  tr  and  the  total 
time  is  5/i  tr  +  3/z  tc. 

Further  Enhancements 


In  the  stronger  mesh  model  of  Schnorr  and  Shamir  a  processor  can  read  the  entire 
memory  of  all  its  neighbors  in  unit  time.  This  implies  that  the  routing  time  is  independent 
of  message  length.  Let  Tr  denote  the  time  needed  to  send  a  message  of  arbitrary  length  to 
a  neighbor  processor.  We  may  generalize  the  sorting  algorithm  of  Figure  3.6  to  the  case 
when  each  processor  has  k  elements.  Now,  to  sort  a  row  or  column  of  data,  we  use  neigh- 
borhood sort  [BAUD78]  which  is  a  generalization  of  even-odd  transposition  sort.  Sup- 
pose that  m  processors  have  k  elements  each.  The  mk  elements  are  sorted  in  m  steps.  In 
the  even  steps,  the  elements  in  each  even  processor  are  merged  with  those  in  the  next  odd 
processor.  The  even  processor  gets  the  smaller  k  and  the  odd  one  the  larger  k.  In  odd 
steps,  the  k  elements  in  each  odd  processor  are  merged  with  the  k  elements  in  the  next 
even  processor.  The  smaller  k  elements  remain  with  the  odd  processor  and  the  larger  k 
with  the  even  one.  Note  that  when  k  =  1  neighborhood  sort  is  identical  to  even-odd  tran- 
sposition sort. 
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Our  generalization  of  Figure  3.6  is  given  in  Figure  3.7.  We  require  that  k  be  a  power 
of  2.  As  in  the  case  of  the  Schnorr-Shamir  algorithm,  we  assume  n  =  24q. 


Step  0:  Fold  the  leftmost  and  the  rightmost  n(k-\)l(2k)  columns  into  middle  n/k 
columns  so  that  the  n2  elements  to  be  sorted  are  in  the  middle  n  x  n/k  submesh 
with  each  processor  in  this  submesh  having  k  elements.  Sort  the  k  elements  in 
each  processor  of  the  middle  n  x  n/k  submesh. 

Step  1:  Sort  all  n3'4  x  n3'4  blocks  of  data  into  snake-like  row-major  order.  Note  that 
each  block  of  data  is  in  an  n314  x  n314  /&  submesh. 

Step  2:  Perform  an  n1/4-way  unshuffle  along  each  row  of  n  elements.  Note  that  each 
row  of  n  elements  is  in  a  row  of  the  middle  n  x  n/k  submesh. 

Step  3:   Repeat  step  1. 

Step  4:  Sort  the  kn  elements  in  each  column  of  the  middle  n  x  n/k  submesh  into  row- 
major  order.  Use  neighborhood  sort. 

Step  5:  Sort  the  elements  in  each  n  x  (n3/4  Ik)  submesh  into  snake-tike  row-major 
order. 

Step  6:  Sort  all  rows  of  the  middle  n  x  n/k  submesh  into  alternating  left-to-right  and 
right-to-left  order. 

Step  7:  Perform  In314  steps  of  even-odd  transposition  sort  along  the  snake  of  the  middle 
n  x  n/k  submesh. 

Step  8:   Unfold  the  middle  n  x  n/k  submesh. 


Figure  3.7  Generalized  sorting  algorithm 

Theorem  2:  The  generalized  sorting  algorithm  is  correct. 
Proof:  Similar  to  that  of  Theorem  1.  □ 
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For  the  complexity  analysis,  we  may  again  ignore  the  odd  steps  as  these  run  in 
0(/z3/4)  time  for  any  fixed  k.  The  folding  and  and  unfolding  of  steps  0  and  8  each  take 
n(k  - 1  )l{2k)  Tr  on  bidirectional  and  nonstrict  unidirectional  meshes  and  n(k  - 1  )lk  Tr 
on  strict  unidirectional  meshes.  The  sorting  of  step  0  takes  0(£log£)  time.  Step  2  takes 
nlk  Tr  on  bidirectional  meshes  and  2n/k  Tr  on  unidirectional  ones.  Step  4  requires  n 
steps  of  neighborhood  sort.  Each  merge  of  two  sets  of  k  elements  takes  atmost  2  k  tc 
time  (actually  atmost  2k -\  comparisons  are  needed).  So,  the  step  4  time  is 
n  Tr  +  2kn  tc  for  bidirectional  meshes  and  2n  Tr  +  2kn  tc  for  unidirectional  meshes. 
The  time  for  step  6  is  nlk  Tr  +  2/z  tc  for  bidirectional  meshes  and  2nlk  Tr  +  2n  tc 
for  unidirectional  ones. 

The  total  sorting  time  (ignoring  the  time  to  sort  k  elements  in  step  0)  for  a  bidirec- 
tional mesh  is  therefore  (2/t  +  nlk)  Tr  +  2n(k  +  l)  tc.  For  the  model  of  Schnorr  and 
Shamir  [SCHN87],  tc  =  0  and  the  time  becomes  (2/i  +  nlk)  Tr.  For  large  values  of  k, 
this  approximates  to  2n  Tr.  Since  (2n-2)  Tr  is  the  distance  lower  bound  for  sorting  on 
Schnorr  and  Shamir's  model,  the  generalized  sorting  algorithm  of  Figure  3.7  is  near 
optimal  for  large  k.  The  sorting  times  on  non-strict  and  strict  unidirectional  meshes  are 
(3/i  +  3nlk)  Tr  +  2n(k  +  l)  tc  and  {An  +  2nlk)  Tr  +  2n(£  +  l)  tc.  Since  (4h-2) 
is  the  distance  lower  bound  for  the  strict  unidirectional  model,  our  algorithm  is  near 
optimal  for  large  k  for  this  model  too. 

The  s  -way  merge  sorting  algorithm  of  Thompson  and  Kung  [THOMK77]  may  be 
similarly  generalized  to  sort  n2  elements  stored  k  to  a  PE  in  an  n  x  (nlk)  mesh 
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configuration.  The  resulting  sort  has  a  complexity  that  is  almost  identical  to  that  of  Fig- 
ure 3.7.  However,  Figure  3.7  is  conceptually  much  simpler. 

Practical  Implications 

Suppose  we  are  to  sort  n2  elements,  that  are,  initially  distributed  one  to  a  PE  on  an 
n  x  n  bidirectional  mesh.  The  final  configuration  is  also  to  have  one  element  per  PE.  On 
realistic  computers,  the  time  to  transfer  small  packets  of  data  between  adjacent  proces- 
sors is  dominated  by  the  setup  time.  For  instance  the  time  to  transfer  N  bytes  of  data 
between  adjacent  processors  of  an  NCubel  hypercube  is  (446.7  +  2.4  N)ms.  Therefore, 
it  is  reasonable  to  assume  that  the  data  transfer  time  is  independent  of  packet  size  for 
small  k  and  small  element  size.  Furthermore,  tR  »  tc  on  most  commercial  computers. 
For  instance,  tR  ~  40  tc  on  the  NCubel.  Table  1  gives  the  value  of  our  complexity  func- 
tions for  the  different  bidirectional  sort  algorithms  and  different  k.  When  tR  ~  40  tc  we 
can  expect  to  get  best  performance  using  the  algorithm  of  section  3  with  k  =  4.  For 
Ir  ~  10  tc,  the  algorithm  of  section  2  is  the  best  of  the  three.  The  original 
Schnorr&Shamir  algorithm  will  perform  best  only  when  tR  <  2  tc. 

Simulation  on  RMBs 

The  sorting  algorithms  described  here  may  be  simulated  by  parallel  computers  in 
the  reconfigurable  mesh  with  buses  (RMB)  family.  We  consider  only  two  members  of  the 
RMB  family:  RMESH  and  PARBUS. 

In  an  RMESH  [MILL88abc],  we  have  a  bus  grid  with  annxit  arrangement  of  pro- 
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Section  2 

Section  3 

3n  tR  +  In  tc 
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t„  =  40  tc 

122n  tc 

103  «  rc 
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100.67/1  tc 

h  =  10  te 

yin  te 

28n  tc 

31.3/t  fc 

32.5n  t. 

34n  fc 

35.67/t  »t 

tg  =Stc 

\lntc 

15.5/1  i,. 

19.66n  /c 

21.25/1  t£ 

23/1  tc 

24.83n  tc 

h  =  2  tc 

8/i  tc 

8/i  tc 

12.67n  tc 

14.5/1  rc 

16.4n  /c 

18.33/1  rc 

Table  1  Comparison  of  bidirectional  sort  algorithms 
cessors  at  the  grid  points  (see  Figure  3.8  for  a  4x4  RMESH  ).  Each  grid  segment  has  a 
switch  on  it  which  enables  one  to  break  the  bus,  if  desired,  at  that  point.  When  all 
switches  are  closed,  all  n2  processors  are  connected  by  the  grid  bus.  The  switches  around 
a  processor  can  be  set  by  using  local  information.  If  all  processors  disconnect  the  switch 
on  their  north,  then  we  obtain  row  buses  (Figure  3.9).  Column  buses  are  obtained  by 
having  each  processor  disconnect  the  switch  on  its  east  (Figure  3.10).  In  the  exclusive 
write  model  two  processors  that  are  on  the  same  bus  cannot  simultaneously  write  to  that 
bus.  In  the  concurrent  write  model  several  processors  may  simultaneously  write  to  the 
same  bus.  Rules  are  provided  to  determine  which  of  the  several  writers  actually  succeeds 
(e.g.,  arbitrary,  maximum,  exclusive  or,  etc.).  Notice  that  in  the  RMESH  model  it  is  not 
possible  to  simultaneously  have  n  disjoint  row  buses  and  n  disjoint  column  buses  that, 
respectively,  span  the  width  and  height  of  the  RMESH.  It  is  assumed  that  processors  on 
the  same  bus  can  communicate  in  0(1)  time.  RMESH  algorithms  for  fundamental  data 
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movement  operations  and  image  processing  problems  can  be  found  in  [MILL88abc, 
MILL91ab,JENQ91abc]. 

Ann  x  n  PARBUS  (Figure  3.11)  [WANG90]  is  an  n  x  n  mesh  in  which  the  inter- 
processor  links  are  bus  segments  and  each  processor  has  the  ability  to  connect  together 
arbitrary  subsets  of  the  four  bus  segments  that  connect  to  it.  Bus  segments  that  get  so 
connected  behave  like  a  single  bus.  The  bus  segment  interconnections  at  a  processor  are 
done  by  an  internal  four  port  switch.  If  the  upto  four  bus  segments  at  a  processor  are 
labeled  N  (North),  E  (East),  W  (West),  and  S  (South),  then  this  switch  is  able  to  realize 
any  set,  A  =  [A  i ,  A  2 } ,  of  connections  where  A ,-  c  { N,E,W,S } ,  1  ^  /<  2  and  the  A  j  's  are 
disjoint.  For  example  A  =  {{N,S},  {E,W}}  results  in  connecting  the  North  and  South 
segments  together  and  the  East  and  West  segments  together.  If  this  is  done  in  each  pro- 
cessor, then  we  get,  simultaneously,  disjoint  row  and  column  buses  (Figure  3.12  and  13). 
If  A  =  {{N,S,E,W },<)>},  then  all  four  bus  segments  are  connected.  PARBUS  algorithms 
for  a  variety  of  applications  can  be  found  in  [MILL91a,  WANG90ab,  LIN92,  JANG92]. 
Observe  that  in  an  RMESH  the  realizable  connections  are  of  the  form  A  =  [A  1 },  A 1  c 
{N,E,W,S}. 

The  RMESH  requires  two  steps  to  simulate  a  unit  route  along  a  row  or  column  of  a 
unidirectional  mesh.  The  PARBUS  can  simulate  such  a  route  in  one  step  using  the  bus 
configurations  of  Figures  12  and  13.  Additionally,  the  folding  and  unfolding  into/from 
nlk  columns  can  be  done  in  (k  —  1 )  tr  time  by  having  each  group  of  k  adjacent  columns 
fold  into  the  leftmost  column  in  the  group.  Now,  the  nlk  columns  that  contain  the  data 
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Figure  3.8  4x4RMESH 
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Figure  3.9  Row  buses 
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Figure  3.10    Column  buses 


Figure  3.11    4x4  PARBUS 
are  not  adjacent.  However,  this  does  not  result  in  any  inefficiency  when  simulating  steps 
1-7  as  row  buses  to  connect  the  columns  are  easily  established.  With  this  modification  to 


the  algorithms  of  Figures  6  and  7,  the  sort  time  f or  an  n  x  n  RMESH  becomes 
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Figure  3.12    Row  buses  in  a  PARBUS 


Figure  3.13    Column  buses  in  a  PARBUS 
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(8/i  +  2)  tr  +  3n  tc,  (4/i  +  Sn/k  +  2k  -  2)  Tr  +  2n(k+l)  tc  and  for  an  n  x  n 
PARBUS  becomes  (4n  +  2)  tr  +  3/z  tc,  {In  +  4/i/it  +  2k  -  2)  Tr  +  2n(k  +  l)  tc. 

Conclusion 


We  have  shown  that  the  lower  bound  for  the  number  of  routes  needed  to  sort  on  an 
n  x  n  bidirectional  mesh  that  was  established  in  [SCHN86]  is  incorrect.  Furthermore,  we 
have  provided  algorithms  that  sort  using  fewer  routes  than  the  lower  bound  of 
[SCHN86].  In  fact,  the  algorithm  of  section  3  is  able  to  sort  using  a  number  of  routes  that 
is  very  close  to  the  distance  lower  bound  for  bidirectional  as  well  as  strict  unidirectional 
meshes. 

The  fold/sort/unfold  algorithms  developed  here  have  practical  implications  to  sort- 
ing on  a  mesh.  The  advantages  of  this  technique  when  applied  to  computers  with 
tR  »  2  tc  were  pointed  out  in  section  4.  We  also  showed  how  to  simulate  the  mesh 
algorithms  on  reconfigurable  meshes  with  buses. 


CHAPTER  4 

COMPUTATIONAL  GEOMETRY  ONJVxiV  RECONFIGURABLE  MESHES 

WITH  BUSES 

Introduction 


In  this  chapter,  we  consider  computational  geometry  problems  on  IVxJV 
reconfigurable  meshes  with  buses  and  give  O(l)  time  algorithms  for  determining  convex 
hull  of  a  set  ofN  planar  points,  the  smallest  enclosing  rectangle  of  N  planar  points,  ECDF 
search  problem  for  N  planar  points,  and  triangulating  N  planar  points. 

Computational  geometry  problems  have  been  widely  studied  on  various  parallel 
computer  architectures  such  as  Meshes,  PRAM,  Hypercubes,  Cube-Connected  Cycles, 
and  Pyramid  Computers  [AGGA88],  [CHOW80],  [JEONL90,  JEONC92], 
[MILLS84ab,88e,89],  [LEE84],  [MACK90ab],  [STOJ86],  [LUV85,  LU86],  [SHAM78], 
[WANGT87].  Most  computational  geometry  problems  have  a  lower  bound  equal  to  the 
diameter  of  the  underlying  interconnection  network.  RMB  algorithms  for  computational 
geometry  problems  has  been  explored  in  [MILLP87b,  JANG92e,  REIS92].  In 
[MTLLP87b]  RMESH  algorithms  for  several  geometric  problems  on  digitized  images 
were  developed.  These  include  closest  figure,  extreme  points  of  every  figure,  diameter, 
smallest  enclosing  box,  and  smallest  enclosing  circle. 
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An  0(1)  time  convex  hull  algorithm  for  a  set  of  N  planar  points  is  given  in 
[REIS92].  The  algorithm  uses  an  N  x  N  configuration  and  is  based  on  the  grid  convex 
hull  algorithm  of  [MILL88e].  Unfortunately,  the  algorithm  of  [REIS92]  is  flawed.  In  sec- 
tion 2,  we  give  a  corrected  constant  time  RMESH  convex  hull  algorithm.  This  may  also 
be  run  on  an  N  x  N  PARBUS  and  MRN.  Jang  and  Prasanna  [JANG92e]  develop  con- 
stant time  PARBUS  algorithms  for  the  all  pairs  nearest  neighbor  problem  on  a  set  of  N 
planar  points,  2-set  dominance  counting  for  N  planar  points,  and  the  3-dimensional  max- 
ima problem.  All  these  algorithms  are  forNxN  PARBUS  configuration.  Their  algo- 
rithm for  the  nearest  neighbor  problem  is  easily  run  on  an  N  x  N  RMESH  and  MRN  in 
constant  time.  Jang  and  Prasanna  [JANG92e]  state  that  their  2-set  dominance  counting 
algorithm  can  be  simulated  by  an  RMESH  using  the  switch  simulation  of  [JANG92d]. 
However  the  simulation  of  [JANG92e]  requires  16  RMESH  processors  for  each 
PARBUS  processor  simulated.  Hence  a  4N  x  4N  RMESH  is  needed  for  the  simulation. 
In  section  4,  we  consider  the  ECDF  search  problem  which  is  closely  related  to  the  2-set 
dominance  counting  problem  and  develop  a  constant  time  algorithm  that  requires  only  an 
NxNRMESH. 

The  third  geometry  problem  considered  in  [JANG92e]  is  the  3-dimensional  maxima 
problem.  In  this,  we  are  given  a  set  S  of  three  tuples  (points  in  three  dimensional  space) 
and  are  required  to  find  all  points  p  e  S  such  that  there  is  no  q  e  S  with  the  property 
that  each  coordinate  of  q  is  greater  than  the  corresponding  coordinate  of  p  (i.e.,  no  q  in  S 
dominates  p).  The  algorithm  suggested  in  [JANG92e]  is  a  modification  of  their  2-set 
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dominance  counting  algorithm  and  is  unnecessarily  complicated.  We  observe  that  the 
problem  is  trivially  solved  in  constant  time  using  the  algorithm  of  Figure  4.1.  The  input 
N  points  are  in  row  1  of  the  N  x  N  PARBUS/RMESH/MRN.  The  algorithm  of  Figure 
4.1  can  be  used  to  solve,  in  constant  time,  the  ^-dimensional  maxima  problem  for  any  k. 

Step  1:     Broadcast  the  N  points  on  row  1  to  all  rows  using  column  buses. 

Step  2:      {Use  the  /'th  row  to  determine  if  the  /'th  point  of  S  is  a  non  dominated  point} 

•  Processor  [/,/]  broadcasts  its  point  to  all  processors  in  its  row  using  a  row  bus, 
1  <  i  <  N 

•  Processor  [ij],  1  £  i,j  <  N  compares  the  coordinates  of  the  point  it  received 
in  step  1  to  those  of  the  points  it  received  in  step  2.  If  each  coordinate  of  the 
step  1  point  is  larger  than  the  corresponding  coordinate  of  its  step  2  point,  then 
PE[iJ]  sets  its  T  variable  to  0.  Otherwise,  T  is  set  to  1. 

•  Using  row  bus  splitting,  PE[/,1]  determines  if  there  is  a  0  in  row  i,  1  £  i  <  N. 
If  so,  it  sets  its  U  variable  to  0  (the  /'th  point  is  dominated).  Otherwise,  U  is  set 
to  1  (the  /'th  point  is  not  dominated). 

Step  3:      {Route  the  results  back  to  row  1 } 

Using  row  buses  PE[/,1]  sends  its  U  value  to  PE[/,/].  Using  column  buses, 
PE[/,/]  sends  the  U  value  just  received  to  PE[1,/],  1  <  /  £M 

Figure  4.1  Simple  Constant  time  algorithm  for  3-dimensional  maxima 

The  remaining  computational  geometry  problems  we  consider  in  this  chapter  are  the 
smallest  enclosing  rectangle  problem  (section  3)  and  the  problem  of  triangulating  a  set  of 
N  planar  points  (section  5).  While  these  problems  have  not  been  considered  before  for 
the  RMB  architecture,  parallel  algorithms  for  other  architectures  have  been  developed. 
For  example,  Miller  and  Stout  [MILL89]  show  how  to  find  the  smallest  enclosing  rectan- 
gle of  a  set  of  N  planar  points  in  0(Nl/2)  on  an  N  x  N  mesh  and  Wang  and  Tsin 
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[WANGT87]  develop  an  O(logA0  time  CREW  PRAM  algorithm  to  triangulate  a  set  of  N 
planar  points.  Miller  and  Stout  [MILL89]  also  develop  optimal  mesh  algorithms  for  the 
convex  hull  of  N  planar  points  and  for  several  other  computational  geometry  problems. 
Mesh  algorithms  for  some  other  computational  geometry  problem  are  considered  in 
[JEON90,  LU85.86]. 

Convex  Hull 


In  this  section,  we  consider  the  problem  of  determining  the  Convex  Hull  of  a  set  of 
N  planar  points  on  N  x  N  Reconfigurable  Meshes  with  Buses  (RMBs).  While  the 
RMESH  algorithms  we  develop  can  be  run  without  modification  on  PARBUSs  and 
MRNs,  improved  performance  results  from  the  use  of  algorithms  for  subtasks  such  as 
ranking  and  sorting  that  are  tailored  to  the  PARBUS/MRN  architecture. 

Let  5  be  a  set  of  N  distinct  points  in  a  plane.  Let  E(S)  be  the  set  of  extreme  points  in 
S  and  let  CH(S)  denote  an  ordering  of  E  such  that  CH(S)  defines  a  convex  polygon.  This 
convex  polygon  is  the  convex  hull  of  S.  The  extreme  points  problem  is  that  of  finding 
E(S)  while  in  the  convex  hull  problem,  we  are  to  find  the  ordered  set  CH(S).  Our  algo- 
rithms to  find  E(S)  and  CH(S)  make  use  of  the  following  known  results: 

Lemma  1:  [Theorem  3.8,  [PREP85],  pi 04]  The  line  segment  /  defined  by  two  points 
is  an  edge  of  the  convex  hull  iff  all  other  points  lie  on  /  or  to  one  side  of  it.  □ 

Lemma  2:  [[PREP85],  pi 05]  Let  p\  and  P2»  respectively,  be  the  lexicographically 
lowest  and  the  highest  points  of  S.  Let  CH(S)  be  (p\,  /j,  l%% ... ,  l^Pi,  "i.  — »  Uj).  The 


-76- 

ordering  is  a  counter  clockwise  ordering  of  the  extreme  points  and  is  such  that  the  inte- 
rior of  the  defined  convex  polygon  is  on  the  left  as  one  traverses  the  extreme  points  in 
this  order.  The  points  /i, ...,  lk,P2  have  the  property  that  the  polar  angle  made  by  each  of 
these  points,  the  positive  x-axis,  and  the  preceding  point  of  CH(S)  is  least.  The  points 
Mi,  M2,  ...,  My  have  the  property  that  the  polar  angle  made  by  each  point,  the  negative  x- 
axis,  and  the  preceding  point  of  CH{S)  is  least  (see  Figure  4.2).  D 


Figure  4.2  Convex  hull  of  a  set  of  points 

N2  Processor  Algorithm 

Before  developing  the  0(1)  N  x  N  processor  algorithm,  we  develop  a  simple  0(1) 
N2  x  N  processor  algorithm.  An  0(1)  N2  x  N  processor  algorithm  for  E(S)  or  CH(S)  is 
easily  obtained  by  using  either  Lemma  1  or  2.  We  elaborate  on  the  algorithm  based  on 
Lemma  1.  The  strategy  employed  by  this  algorithm  is  given  in  Figure  4.3.  It  assumes  that 
the  N  points  of  S  are  initially  in  the  top  row  of  the  A^2  xN  RMESH/PARBUS/MRN. 
The  fth  row,  I  <  i  <  N2  of  the  RMESH/PARBUS/MRN  is  used  to  determine  if  points; 
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and  k  where  i  =  (;-  \)N  +  k  of  5  form  an  edge  of  the  convex  hull.  In  case  they  do, 
then;  and  it  are  in  £(S)  and;  immediately  precedes  or  immediately  follows  k  in  CH(S). 

The  correctness  of  Figure  2  is  a  direct  consequence  of  Lemma  1  and  the  fact 
[PREP85,  plOO]  that  an  ordering  of  the  points  of  E  by  polar  angle  about  an  internal  point 
yields  the  convex  hull.  The  complexity  of  the  algorithm  is  readily  seen  to  be  0(1).  Note 
that  if  only  E(S)  is  to  be  computed  then  steps  9-11  may  be  omitted. 

Step  1 :     Use  column  buses  to  broadcast  the  points  of  S  from  row  1  to  all  other  rows. 

Step  2:  In  row  i  =  (j-l)N  +  k,  1  <  i  <  N2,  points  ;'  and  k  are  broadcast  to  all 
processors  using  a  row  bus. 

Step  3:  Using  the  points;  and  k  received  in  step  2,  each  processor  computes  a,  b,  and  c 
such  that  ax  +  by  +  c  =  0  defines  the  straight  line  through  points;  and  k. 

Step  4:  Let  (u,  v)  be  the  coordinates  of  the  point  of  S  assigned  to  processor  [e,f]  in  step 
1.  Processor  [e,  f\  sets  its  flag  to  1  if  au  +  bv  +  c  >  0,  -1  if 
au  +  bv  +  c  <  0,  0  if  au  +  bv  +  c  =  0  and  (m,v)  is  on  the  line  segment 
between  points;  and  k  of  step  3  (this  includes  the  cases  when  (u,v)  is  point;  or 
k),  2  otherwise. 

Step  5:  Using  row  buses  and  row  bus  splitting,  PE[/,1],  i  =  (j-l)N  +  k, 
1  <  i  <  N2,  determines  if  there  is  a  flag  =  2  on  row  i.  If  so,  (j,  k)  is  not  an 
edge  of  the  convex  hull.  If  not,  it  determines  if  there  is  a  1  and  later  if  there  is  a 
-1.  If  both  a  1  and  -1  are  present,  (/,£)  is  not  M  edge  of  the  convex  hull. 
Otherwise  it  is. 


Figure  4.3  0(1)  N2  x  N  processor  algorithm  for  £(S)  and  CH(S) 
(continued  on  next  page) 
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Step  6:  If  PE[i,l],  i  =  (j-l)N  +  k,  i  <  <  N2,j  >  k  detects  that  (jjc)  is  an  edge  of 
the  convex  hull,  then  using  a  row  bus  it  broadcasts  a  1  to  PEs  [ij]  and  [ijc]. 

Step  7:  PEs  that  receive  a  1  in  step  6  broadcast  this  to  the  PEs  in  row  1  of  the  same 
column  (note  0  or  4  PEs  in  each  column  receive  a  1  in  step  6;  using  column  bus 
splitting,  all  but  one  of  these  may  be  eliminated  to  avoid  concurrent  writes  to  a 
column  bus). 

Step  8:  Row  1  PEs  now  mark  the  points  of  S  they  contain  as  being  in  or  out  of  E(S) 
(note:  a  point  is  in  E(S)  if  and  only  if  it  receives  a  1  value  in  step  7). 

Step  9:  Using  row  bus  splitting  in  row  1,  any  three  extreme  points  are  accumulated  in 
PE[1,1].  In  case  |  E(S)  \  =  2,  the  remainder  of  this  step  is  omitted.  The 
centroid  of  these  three  points  is  computed.  Since  no  three  points  of  E  can  be 
collinear,  the  centroid  is  an  interior  point  of  the  convex  hull. 

Step  10:  The  centroid  computed  in  step  9  is  broadcast  to  all  points  of  E(S)  using  a  row 
bus.  Each  of  these  points  computes  the  polar  angle  made*  by  the  point  using 
the  centroid  as  the  origin. 

Step  11:  The  points  of  E(S)  are  sorted  by  polar  angle  using  the  0(1) 
RMESH/PARBUS/MRN  sort  algorithm  of  [NIGA92]. 


Figure  4.3  0(1)  N2  x  N  processor  algorithm  for  E(S)  and  CH(S) 
(continued  from  previous  page) 

N2  Processor  Algorithm 

The  convex  hull  of  5  can  be  computed  in  0(1)  time  onaniVxJV  RMESH  using  the 
algorithm  of  Figure  4.4.  The  algorithm  for  an  N  x  N  PARBUS/MRN  is  similar.  Step  RO 
can  be  done  using  the  0(1)  sorting  algorithm  for  N  data  items  on  an  N  x  N  RMESH 
[NIGA92].  Step  Rl  is  done  using  the  0(1)  /V3  processor  algorithm  of  the  preceding 


An  alternative  to  using  the  polar  angle  is  discussed  on  plOO  of  [PREP85]. 
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section  on  each  N  x  N1'2  sub  RMESH.  For  R2,  we  use  the  i'th  N  x  N1'2  sub  RMESH 
to  determine  which  points  of  £,  are  also  in  £.  This  is  done  using  the  algorithm  of  Figure 
4.5. 

RO:  Sort  S  by  ^-coordinate  and  within  x  by  y-coordinate. 

Rl:  Partition  S  into  N112  subsets,  S,-,  1  <  i  <,  Nl/2  using  the  ordering  just  obtained. 

Each  subset  is  of  size  Nl/2.  Determine  the  convex  hull  C//,  and  extreme  points 
£,•  of  Si  using  aniVx  N1'2  sub  RMESH  for  each  f,l  <  i  <  Nm. 

R2:  Combine  the  extreme  points  £,,  1  £  i I  <,  N1'2,  to  obtain  the  extreme  points  E 

of  S. 

R3:  Obtain  the  convex  hull  of  S  from  E. 

Figure  4.4  N  x  N  RMESH  algorithm  for  convex  hull 

Theorem  1:  The  extreme  points  algorithm  of  Figure  4.5  is  correct. 

Proof:  To  establish  the  correctness  of  Figure  4.5,  we  need  to  show  that  exactly  those 

Ni/2 

points  of  {j  Ei,  that  are  not  in  E(S)  are  eliminated  in  steps  T5  and  T6.  It  is  easy  to  see 

k  =  1 

that  no  point  eliminated  in  T5  or  T6  can  be  in  E  as  in  T5  the  eliminated  points  are  not  in 
the  convex  hull  of  £,•  u  E}  and  in  T6  they  are  not  in  the  convex  hull  of  u  £*.  To  see  that 
the  points  not  eliminated  are  all  in  E(S),  suppose  that  there  is  a  non  eliminated  point 
pe  Ei  that  is  not  in  E(S).  Let  Q  be  the  set  of  non  eliminated  points.  Clearly,  E(S)  = 
extreme  points  of  Q  and  p  is  in  the  interior  of  the  convex  hull  of  Q  (see  Figure  4.8), 
CH(Q).  Without  loss  of  generality,  assume  that  |  E(S)  \  >  2.  Since  p  is  in  the  interior  of 
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Tl:  Let  Rp  denote  the  (;,/)'th  Nm  x  N112  sub  RMESH  of  the  entire  RMESH. 

Move  Ei  and  Ej  into  Rji  such  that  each  row  of  Rji  has  E\  (one  point  to  a 
processor)  and  each  column  has  Ej  (one  point  to  a  processor).  This  can  be 
accomplished  by  broadcasting  row  1  along  column  buses  to  all  rows  and  then 
broadcasting  the  diagonal  (of  the  N  x  N  RMESH)  processor  data  along  row 
buses  to  all  columns. 

T2:  Each  PE  in  the  k'\h  column  of  Rji  computes  the  slope  of  the  line  that  joins  the 

Jt'th  point  of  Ei  and  the  point  of  Ej  it  contains.  These  slopes  form  a  tritonic 
sequence  (they  may  decrease,  increase,  decrease  or  may  increase,  decrease, 
increase)  as  the  points  of  Ej  are  in  convex  hull  order. 

T3:  The  minimum  and  maximum  of  the  slopes  in  each  column  of  Rji  are  found 

using  bus  splitting.  These  are  stored  in  row  1  of  Rji. 

T4:  The  maximum  of  the  at  most  N1'2  minimums  found  in  step  T3  is  determined 

by  forming  the  cross  product  of  the  minimums  in  the  N  processors  of  Rji.  The 
minimum  of  the  maximums  of  step  T3  is  similarly  found. 


Figure  4.5  Substeps  for  step  R2  (continued  on  next  page) 


T5:  The  minimum  and  maximum  of  step  T4  define  the  two  tangents  of  £,  and  Ej. 

These  are  used  to  eliminate  points  of  £,•  that  are  now  known  to  not  be  in  E  (see 
Figure  4.6). 

T6:  The  tangents  of  T5  define  up  to  4  tangent  points.  Over  all  Rji,  1  <  i  <  N1'2, 

there  are  3  N1'2  -2  non  eliminated  tangent  points  plus  non  eliminated  points 
of  Si.  The  extreme  points  of  this  set  of  points  is  computed.  Tangent  points  of 
Ei  that  are  not  in  the  set  of  extreme  points  just  computed  are  eliminated  (see 
Figure  4.7). 

T7:  The  points  of  u  Ei  not  eliminated  in  steps  T5  and  T6  define  E. 


Figure  4.5  Substeps  for  step  R2  (continued  from  previous  page) 
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Tangents  =  Broken  lines 

*  =  tangent  points  of  £,  and  Ej 

X  =  points  of  E,  eliminated 


Figure  4.6  Examples  of  extreme  points  eliminated  in  step  T5 


i  =  2 

*  =  tangent  points 

X   =  extreme  points  of  E2  eliminated  in  step  T5 

+   =  extreme  points  of  E2  eliminated  in  step  T6 


Figure  4.7  Example  of  extreme  point  elimination  in  step  T6 
CH(Q),  there  must  be  a  q  e  Q  such  that  q  4  5/  and  the  line  from  p  to  q  does  not  go 
through  the  interior  of  C//(5,)  and  does  not  touch  the  boundary  of  CH(St)  except  at  p. 


Without  loss  of  generality,  assume  that  q  e  Sj  and  q  is  to  the  right  of  p  or  vertically 
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abovep.  If  p  lies  between  the  two  tangents  from  q  to  Si,  thenp  lies  between  the  tangents 
of  5,-  and  Sj  and  so  is  eliminated  in  step  T5  (see  Figure  4.8).  Since p  e  Q,pis  not  elim- 
inated in  T5  and  hence  must  be  on  a  tangent  from  q  to  Si  (Figure  4.10).  Furthermore,  ifp 
is  not  a  tangent  point  with  respect  to  5,-  and  Sj,  then/?  lies  between  the  two  tangents  of  5, 
and  Sj  and  is  eliminated  in  T5.  So,  we  may  assume  p  is  one  of  the  3N1/2  —  2  points 
considered  in  step  T6. 


CH(Q) 


Figure  4.8  Example  for  correctness  proof 

First  suppose  that  p  is  a  point  of  the  leftmost  partition  S  \ .  Since  p  is  in  the  interior 
of  CH(Q),  part  of  the  boundary  of  CH(Q)  passes  above  p  and  part  below  p.  Let  a  be  the 
leftmost  point  in  E\,  b  the  first  point  in  CH(Q)  that  the  lower  boundary  reaches  outside 
of  Si,  and  c  the  first  point  in  CH(Q)  that  the  upper  boundary  reaches  outside  of  Si  (see 
Figure  4.11).  If  b  =  c,  then  p  is  not  on  the  tangent  from  c  to  CH(S\)  and  so  must  have 
been  eliminated  in  step  T5.  So,  b  #  c.  Let  d  and  e  ,  respectively,  be  the  points  in  S,  to 
which  c  and  b  are  connected  in  CH(S  i ).  Note  that  it  is  possible  that  d  =  a  or  e  =  a  (or 
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tangents  from  q  to  Sj 
tangents  between  5,  and   Sj 


Figure  4.9  Tangents 


Figure  4.10  p  is  on  tangent  from  q 
both).  If  c  e  Sj,  then  d  and  c  must  be  tangent  points  with  respect  to  S\  and  Sj  (as  other- 
wise, some  points  of  S\  and/or  Sy  lie  above  the  upper  boundary  of  CH(S).  Similarly,  e 
and  b  are  tangent  points  with  respect  to  5j  and  S*  where  b  e  S*.  It  is  clear  thatp  is  not 
an  extreme  point  of  [a,  b,  c,  d,  e,  p).  Since  {a,  b,  c,  d,  e,p)  is  a  subset  of  the  set  used  in 
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step  T6,  p  must  have  been  eliminated  in  this  step. 


Figure  4.11  p  e  Si 

Next,  suppose  that  p  is  a  point  of  S^in.  The  proof  for  this  case  is  similar  to  that  for 
p  €  S\.  Finally,  suppose  p  e  Sj,  i  4  {1,  N1'2}.  Let  a,  b,  c,  d  be  the  first  points  of  the 
portions  of  the  boundary  of  CH(Q)  that  are  above  and  below  p  that  are  not  in  5,-  (Figure 
4.12).  Note  that  it  is  possible  that  a  =  b  and  c  =  d.  Also,  note  that  a,  b,  c,  and  d  must 
exist  as  \Sj\  =  N1'2  for  each  j  (i.e.,  as  no  Sj  is  empty). 

Suppose  that  a  e  Sj.  There  must  be  a  tangent  point  with  respect  to  Sj  and  5,-  that  is 
in  the  mangle  defined  by  a,  a  and  a    (see  Figure  4.12).  This  is  so  as/?  is  a  tangent  point 
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with  respect  to  Sj  and  the  other  end  of  the  tangent  must  lie  in  the  said  triangle.  Let  this 
point  be  A.  Similarly,  there  are  tangent  points  B,  C,  D  in  the  corresponding  triangles 
associated  with  b,  c,  and  d.  From  the  triangles  in  which  A,  B,  C,  D  are  located,  we  see 
that p  is  contained  in  the  convex  hull  of  [A,  B,  C,D,p}  and  so  must  be  eliminated  in  step 
T6.  Note  that  when  a  =  b  and  c  =  d,  there  must  be  a  tangent  point,  e,  of  Et  that  is  on 
the  upper  and  lower  part  of  the  boundary  of  CH{S).  Now,  p  is  not  an  extreme  point  of  {A, 
B,  C,  D,e,p]  (we  need  point  e  as  it  is  possible  that  A  =  B,  and  C  =  D).  □ 


Figure 4.12/j  e  Shi4  {l,N112} 

One  may  verify  that  each  of  the  steps  Tl  through  T7  can  be  done  in  0(1)  time.  For 
step  T6,  a  modified  version  of  the  N3  processor  algorithm  of  the  previous  section  is  run 
in  each  N  x  N  sub  RMESH.  Since  the  number  of  edges  to  consider  is  less  than 
(3N1/2)2  =  9N,  the  algorithm  is  run  at  most  nine  times.  In  each  run  of  the  algorithm  an 
edge  needs  to  compare  with  3N1/2  -  2  points  using  N1'2  processors.  This  is  done  in 
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three  passes. 

For  step  R3,  the  points  of  E  may  be  sorted  by  polar  angle  about  an  interior  point  as 
in  steps  10  and  1 1  of  Figure  2. 

Smallest  Enclosing  Rectangle 
It  is  well  known  [FREE75]  that  the  smallest  enclosing  rectangle  of  a  set  of  N  planar 
points  has  at  least  one  side  that  is  an  extension  of  an  edge  of  the  convex  hull.  Hence,  the 
smallest  enclosing  rectangle  may  be  found  by  first  computing  the  convex  hull;  then  deter- 
mining for  each  convex  hull  edge,  the  smallest  enclosing  rectangle  that  has  one  side 
which  is  an  extension  of  this  edge;  and  finally  determining  the  smallest  of  these  rectan- 
gles. The  algorithm  is  given  in  Figure  4.13.  Its  correctness  is  immediate  and  one  readily 
sees  that  its  complexity  is  0(1)  on  an  N  x  N  RMESH,  PARBUS  and  MRN. 

ECDF  Search  On  An  N  x  W RMESH 


Let  S  be  the  set  of  N  distinct  planar  points.  Point/?  e  S  dominates  point  q  e  5  if  and 
only  if  each  coordinate  of  p  is  greater  than  the  corresponding  coordinate  of  q.  In  the 
ECDF  search  problem  [STOJ86],  we  are  to  compute  for  each  pe  S,  the  number,  D(p,S), 
of  points  of  5  that  are  dominated  by  p.  Stojmenovic  [STOJ86]  provides  an  iV  (iV  =  |  S  |) 
processor,  0(log2  AO  time  hypercube  algorithm  for  this  problem. 

On  an  N  x  N  RMB,  D(p,S)  for  each  point  p  e  S,  may  be  computed  using  a  stra- 
tegy similar  to  that  used  in  [JANG92e]  to  solve  the  2-set  dominance  counting  problem  on 
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Step  1 :     Find  the  convex  hull  of  the  N  points. 


Step  2:     [Construct  convex  hull  edges] 

(a)  The  convex  hull  points  are  broadcast  to  all  rows  using  column  buses.  These 
are  saved  in  variable  R  of  each  processor. 

(b)PE[/,j]  broadcasts  its  R  value  using  a  row  bus  to  the  S  variable  of  all 
processors  on  row  i,  1  <,  i  <  p,p  =  number  of  convex  hull  points. 

(c)  PE[/,/  + 1]  broadcasts  its  R  value  using  a  row  bus  to  the  T  variable  of  all 
processors  on  row  /,  1  <  i  <  p  -  1.  For  i  =  p,  PE[p,l]  does  this.  {Note:  Now 
each  PE  in  row  i  contains  the  same  edge  of  the  convex  hull  in  its  S  and  T 
variables) 

Step  3:      [Determine  area  of  minimum  rectangle  for  each  edge] 

(a)  Using  its  R,  S,  and  T  values,  each  PE  computes  the  perpendicular  distance  D 
between  point  R  and  the  straight  line  defined  by  the  points  S  and  T.  Since,  the 
R's  are  in  convex  hull  order,  the  D  values  in  a  row  form  a  tritonic  sequence 
whose  minimum,  h,  can  be  found  in  0(1)  time  using  row  bus  splitting,  h  is  the 
minimum  height  of  the  rectangle  for  the  row  edge. 


Figure  4.13  Minimum  enclosing  rectangle  (continued  on  next  page) 
a  PARBUS  in  0(1)  time.  A  high  level  description  of  the  strategy  is  given  in  Figure  4.14. 
For  simplicity  in  presentation,  we  assume  that  no  two  points  of  S  have  the  same  x-  or  the 
same  y-coordinates. 

Step  1  can  be  done  by  sorting  the  N  points  in  0(1)  time  by  ^-coordinate  [NIGA92]. 
The  point  in  PE[1,  j]  belongs  to  partition  Xu  where  u  =  [      ,,-,  J  +  1.  Step  2  is  simi- 

larly  accomplished  by  another  sort  (this  time  by  y-coordinate).  Following  this  each  point 
knows  which  X  and  which  Y  partition  it  belongs  to.  The  computation  of  DX(p)  is  done 
using  aniVx  N112  submesh  for  the  points  in  each  X  partition.  The  /'th  such  submesh  is 


-88- 


Step  3:  (b)  Let  P  be  the  perpendicular  through  the  middle  of  the  edge  defined  by  points 
S  and  T.  Each  processor  computes  the  perpendicular  distance  of  its  point  R 
from  the  infinite  line  P  (use  negative  distances  for  points  on  one  side  of  P  and 
positive  distances  for  points  on  the  other  side).  These  distances  form  a 
quadratonic  sequence  and  its  maximum,  dmax,  and  minimum,  rfmin.  can  be 
found  in  0(1)  time  using  row  bus  splitting. 

(c)  The  minimum  area,  A,  of  the  rectangle  for  row  i  is  h*(dmax  -  dmjn).  Let 
this  be  stored  in  the  A  value  of  PE[/,1]. 

Step  4      [Determine  overall  minimum  rectangle] 

Compute  the  minimum  of  the  A's  of  PEs  [/,1],  1  <  i  <  p.  This  is  done  by 
forming  the  cross  product  of  the  A's  in  a  p  x  p  sub  array  and  comparing  the 
two  A's  in  each  PE. 


Figure  4.13  Minimum  enclosing  rectangle  (continued  from  previous  page) 


Step  1:  Partition  5  into  Nm  sets  Xh  1  <  i  <  N112  such  that  |Xj  |  =  Nm, 
1  <  i  <  N1'2,  and  no  point  in  X;  has  a  larger  x-coordinate  than  any  of  the 
points  in  X{  +  1,1  <  i  <  N1'2. 

Step  2:  Partition  S  into  W1/2  sets  YJt  1  <  ;  <.  Nm  such  that  |  Yj  \  =  N1'2, 
1  <  j  <  N1'2,  and  no  point  in  Yj  has  a  larger  y-coordinate  than  any  of  the 
points  in  Yj+l,\  <,  j  <  N112. 

Step  3:  Let  5/;  =  X,-  p|  Yj  (see  Figure  4.15).  For  each  p  e  Sy,  it  is  the  case  that 
D(p,S)  =  #  of  points  dominated  by  p  in  (Yj  -  5y)  +  #  of  points  dominated 
bypinXi  +    £    E  |5UV  |=  DY(p)  +  DX(p)  +    £    £  \Sm\< 

u  <i  v <j  u<i  v <j 

Compute  D(p, S)  using  this  equation. 


Figure  4.14  Strategy  to  compute  D(p, S) 
used  for  X,-.  For  this,  each  p  e  X,-  uses  an  N1'2  x  N112  partition  of  the  N  x  N112  sub- 
mesh.  In  processor  k  of  row  1  of  this  N112  x  N112  partition,  a  variable  T  is  set  to  1  iff/? 
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Figure  4.15  Partitioning  of  the  points  of  5 
dominates  the  Jk'th  point  of  X,-.  T  is  set  to  zero  otherwise.  The  T's  in  each  N1'2  x  Nv2 
partition  may  be  summed  in  0(1)  time  using  the  RMESH  ranking  algorithm  of 
[JENQ91a].  The  result  is  DX(p).  The  computation  of  DX(p)  is  best  done  after  Step  1  of 
Figure  4.14  as  at  this  time  the  points  of  S  are  in  the  needed  order.  DY(p)  may  be  com- 
puted in  a  similar  way  following  Step  2.  However  when  setting  the  value  of  T,  we  note 
that  T  is  to  be  set  to  1  iff  p  dominates  both  the  k'th  point  of  Yj  and  the  £'th  point  is  not  in 
the  same  X  partition  as  p. 

Let  My  =   £    Yi  I  suv  |  •  To  compute  Miy,  we  first  sort  the  points  of  5  by  their  Y 

u  <i  v  <j 

partition  index  and  within  each  Y  partition,  they  are  sorted  by  the  X  partition  index.  Fol- 
lowing this,  each  processor  in  row  1  determines  if  the  X  partition  number  of  the  point  in 
the  processor  to  its  left  is  less  than  that  of  its  own  point.  If  so,  then  using  its  own  column 
index  and  the  Y  partition  number  of  its  point,  it  can  compute  Ry  =   J)  $uj>  where  /  and 


u<< 


j  are,  respectively,  the  X  and  Y  partition  numbers  of  its  point  (/?/;-  =  q  -  (j-l)N 


1/2 
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where  q  is  the  processor's  column  index).  Finally,  Mi}  =    £  fl,v.  Before  describing 

how  to  do  this  final  summation,  we  provide,  in  Figure  4.16,  the  steps  in  the  overall  ECDF 
algorithm. 


Step  1:     Each  point  determines  its  X  partition.  This  is  done  by  sorting  the  points  by  x- 
coordinate. 

Step  2:     Compute  DX(p)  for  each  point.  This  is  done  using  an  N112  x  N1'2  block  for 
each/?  e  S. 

Step  3:     Each  point  determines  its  Y  partition.  For  this,  sort  all  points  by  y-coordinate. 
Step  4:     Compute  DY(p)  as  in  step  2. 

Step  5:     Sort  S  by  Y  partition  and  within  Y  partition  by  X  partition.  Each  p  e  S 
determines  its  Ry  value  where  i  and;  are  such  that/7  e  Xt  and/;  e  Yj. 

Step  6:     Each  point/?  determines  My  =    £  Rtv 

V  <j 

Step  7:     Each  point p  computes  D(p,S)  =  DY(p)  +  DX(p)  +  M/7(/7). 

Figure  4.16  RMESH  algorithm  to  compute  D(p, S) 
Now,  let  us  consider  the  computation  of  My  =    £  ^fv  F°r  eacn  '>  me  "       *■# 

v<j 

values  (1  <  j  <  N1'2)  are  computed  using  the  Z'th  N  x  N1'2  submesh.  Initially,  we  have 
Riv  stored  in  the  R  variable  of  the  PEs  in  the  v'th  N1'2  x  N1'2  submesh  of  the  fth 
N  x  N1'2  submesh  (Figure  4.17(a)).  The  steps  required  to  compute  Mjj  are  given  in  Fig- 
ure  4.18.  Its  correctness  is  easily  verified  and  its  complexity  is  O(l). 
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(a)   /'th  N  xNv2  submesh 
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(b)  A^3'4  x  N1'2  submesh 


Figure  4.17 


Step  1:     [Decompose  fl's  using  the  radix  N1/4] 

PE[1,  1]  of  each  of  the  N1'2  x  N1'2  submeshes  computes  a  and  b  such  that 
R  =  a*N114  +  b  and  0  <  b  <  W1M.  Since  |  Fv  |  ■  A^1/2,  each  R*  <  Nm. 
Hence,  0  <,  a  <N1'4. 

Step  2:     [Prefix  sum  the  a's] 

(a)  The  fth  N  x  Nm  submesh  is  partitioned  into  N1'4  NV4  x  Nm 
submeshes.  In  each  of  these,  the  Nl/4  a  values  contained  are  routed  to  the  row 
1  processors  such  that  the  k'th  group  ofN114  such  processors  contain  the  k'th  a 
value  (Figure  4.7(b)). 

(b)  In  each  N314  x  N112  submesh,  the  unary  representation  of  the  a's  is 
computed.  For  this,  the  r'th  processor  in  an  N1/4  sets  its  D  value  to  1  if  r  is  less 
than  or  equal  to  its  a  value.  Otherwise,  it  sets  D  to  zero. 


Figure  4.18  Computation  of  M/y  (continued  on  next  page) 
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Step  2:  (c)  The  one's  in  row  1  of  the  N314  x  N1'2  submesh  are  prefix  summed  using 
the  RMESH  ranking  algorithm  of  [JENQ91a].  Let  the  overall  sum  for  each 
group  of  Nm  a's  be  stored  in  variable  A  of  the  [1,1]  PE  of  the  N314  x  N1'4 
submesh. 

(d)  [Prefix  sum  the  A's] 

We  now  have  Nm  A  values  to  be  prefix  summed.  Each  is  in  the  range  [0, 
Nm].  The  A's  are  decomposed  using  radix  N114  such  that  A  =  c*N114  +  d 
where  0  <  d  <  N114.  The  N114  c's  may  be  prefix  summed  using  an 
N112  x  N112  submesh  by  routing  them  to  row  1  of  any  N1'2  x  N1'2  submesh, 
computing  their  unary  representation  (as  in  (b)  above),  and  ranking  (as  in  (c) 
above).  The  d's  are  prefix  summed  similarly.  By  adding  together  corresponding 
pairs  of  prefix  sums  of  c's  and  d's,  the  prefix  sums  of  the  A's  are  obtained. 

(e)  [Obtain  prefix  sum  of  a's]  The  prefix  sum  of  the  a's  is  obtained  by  adding 
together  the  prefix  sum  of  the  a's  in  each  N1'4  group  (as  computed  in  (c))  and 
appropriate  prefix  sum  of  A  (as  computed  in  (d)). 

Step  3:     [Prefix  sum  the  b's]  This  is  similar  to  Step  2. 

Step  4:  [Obtain  Mn]  M{j  is  the  sum  of  the  a  and  b  prefix  sums  computed  in  step  3  and 
step  4  for  the  y'th  (from  the  bottom)  Nl/2  x  N112  submesh  of  the  i'th 
N  x  NXI2  submesh. 


Figure  4.18  Computation  of  My  (continued  from  previous  page) 
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Tri  angulation 


In  this  section,  we  develop  an  0(1)  time  algorithm  to  triangulate  a  set  S  of  N  planar 
points.  The  algorithm  is  for  an  N  x  N  RMESH,  an  N  x  N  PARBUS  and  an  N  x  N 
MRN.  For  simplicity,  we  assume  that  the  N  points  have  distinct  ^-coordinates  and  that  no 
four  are  cocircular.  Our  algorithm  is  easily  modified  to  work  when  these  assumption  do 
not  hold.  The  overall  strategy  is  given  in  Figure  4.19.  We  begin,  in  step  1,  by  partitioning 
the  N  points  into  N213  sets,  S,-,  each  of  size  N1/3.  For  this,  we  assume  Wis  a  power  of  3. 
The  partitioning  is  done  by  x-coordinate.  Each  S;  contains  points  that  are  to  the  left  of 
those  in  5I  +  1,  1  <,  i  <  N2'3.  This  partitioning  is  possible  because  of  our  assumption 
that  the  x-coordinates  are  distinct.  To  accomplish  the  partitioning,  the  N  points  are  sorted 
by  jc-coordinate  using  the  0(1)  sorting  algorithm  of  [NIGA92]. 


Step  1:     Partition  5  into  N2'3  sets,  5/,  each  containing  N113  points.  The  partitioning  is 
done  such  that  all  points  in  S ,-  are  to  the  left  of  those  in  S ,-  + 1 , 1  £  i  <  N2'3 . 

Step  2:     Using  aniVxJV1'3  submesh  for  each  Sit  compute  the  Voronoi  diagram  for  5,-, 
1  <  i  <  N2'3. 

Step  3:     Compute  the  straight-line  dual  of  each  of  the  N213  Voronoi  diagrams  of  step  2. 

Step  4:     Partition  the  region  not  in  the  union  of  the  convex  hulls  of  the  5,'s  into  basic 
polygons  and  triangulate  these. 


Figure  4.19  Triangulating  iV  planar  points 

In  step  2,  the  Voronoi  diagram  of  each  5,-  is  computed.  For  this,  each  point/?  e  5,- 
computes  its  voronoi  polygon  which  defines  all  points  in  the  plane  that  are  closer  to  p 
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than  to  any  of  the  other  points  in  5/.  As  noted  in  [PREP85],  for  any  two  points  p  and  q, 
the  set  of  points  that  are  closer  to  p  than  to  q  is  the  half  plane  containing  p  that  is  defined 
by  the  perpendicular  bisector  of  the  line  joining  p  and  q.  The  Voronoi  polygon,  V(p), 
(i.e.,  the  polygon  whose  interior  is  the  set  of  points  closer  to  p  than  to  any  other  point  in 
Si)  is  defined  by  the  intersection  of  the  N113  -  1  half  planes  obtained  by  considering  all 
q  e  Si  -  {p}.  V(p)  is  comprised  of  portions  of  at  most  N1/3  -  1  perpendicular  bisectors. 
Consider  the  five  points  {a,  b,  c,  d,  e)  of  Figure  4.20  and  suppose  we  are  determining 
which  portion  of  the  perpendicular  bisector  aP  of  ab  is  a  boundary  of  V(a).  The  perpen- 
dicular bisector  of  ac  intersects  cc|3  at/.  Since  points  in  the  half  plane  above  the  perpen- 
dicular bisector  of  ac  are  closer  to  c  than  to  a,  only  the  portion  of  aP  with  ^-coordinate 

>  f.x  (f.x  is  the  x-coordinate  off)  may  contribute  to  V(a).  Similarly,  since  the  perpen- 
dicular bisector  of  ad  intersects  aP  at  g,  only  the  portion  of  0$  with  x-coordinate  >  g.x 
may  contribute  to  V(a).  Finally,  since  the  perpendicular  bisector  of  ae  intersects  a0  at  h 
and  since  points  below  (or  to  the  right)  of  this  bisector  are  closer  to  e  than  to  a,  only  the 
portion  of  ocp  with  x-coordinate  <  h.x  can  contribute  to  V(a).  So,  by  looking  at  the  three 
other  perpendicular  bisectors,  we  see  that  only  the  portion  of  aP  that  has  ^-coordinate 

>  mx  =  max  {f.x,  g.x}  and  <  m.n  =  h.x  is  a  boundary  of  V(a).  Note  that  if  mx  <  mn, 
then  aP  does  not  contribute  to  V(a). 

This  strategy  for  step  2  of  Figure  4.8  may  be  implemented  on  an  RMESH  or 
PARBUS.  First,  the  points  of  5,-  are  broadcast  to  all  rows  of  the  N  x  N113  submesh 
using  column  buses.  Next,  we  compute  V(p)  for  each  point/?  e  S;  using  N2/3  x  N1/3 
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Figure  4.20  Determining  portion  of  V(a)  contributed  by  ccp 
partitions  of  the  N  x  N113  partition.  The/th  such  partition  (from  the  top)  is  used  to  com- 
pute V(jj)  for  the  /th  point  p  e  S;.  For  this,  the  N213  x  N1'3  partition  is  further  parti- 
tioned into  N1'3  N1'3  x  Nu3  partitions  as  in  Figure  4.21.  The  *'th  N113  x  N1'3,  Ak,  is 
used  to  determine  the  portion  of  the  perpendicular  bisector  of  ppk  (where  pk  is  the  £'th 
point  of  Si)  that  contributes  to  V(p),  1  <  k  <  Nll3,pk  #  p.  For  this,  the  points/?  and/?* 
are  broadcast  to  all  processors  Ak.  Each  processor  then  computes  the  perpendicular 
bisector  0$  of  ppk  as  well  as  that  of  pq  where  q  is  the  point  initially  in  the  processor.  The 
processors  initially  containing  p  or  pk  are  excluded.  The  intersection  between  aP  and  the 
bisector  of  pq  is  obtained.  Let  the  ^-coordinate  of  this  intersection  be  y.  The  processor 
now  sets  itself  to  <  y  or  >  y  depending  on  which  portion  of  a|3  has  not  been  eliminated. 
In  case  a(3  is  parallel  to  the  bisector  of  pq  and  on  the  same  side  of  p,  then  the  processor 
sets  itself  to  >  °°,  if  a|3  is  farther  from  p  than  the  bisector  of  pq  and  to  <  °°  otherwise. 
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Following  this,  the  maximum  mx  of  the  >  values  and  the  minimum  mn  of  the  ^  values 
is  found  using  all  processors  of  A^.  If  mn  <  mx  then  cc(3  contributes  to  V(p)  and  the  con- 
tributing endpoints  together  with  the  associated  vertices  p  and/?*  are  saved  in  processor 
[1,1]  of  Ak. 
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Figure  4.21  N1'2  x  N1'3  partitions 

The  purpose  of  computing  the  straight  line  dual  in  step  3  is  to  obtain  the  triangula- 
tion  of  the  regions  defined  by  the  convex  hull  of  each  of  the  5,'s.  Theorem  5.11  of 
[PREP85]  states  that  the  straight  line  dual  of  the  Voronoi  diagram  of  S  is  a  triangulation 
of  S  (Strictly  speaking,  this  is  true  only  if  no  four  points  of  5  are  cocircular.  However, 
when  four  or  more  points  are  cocircular,  the  completion  of  the  triangulation  is  easy  once 
the  dual  has  been  obtained).  The  straight  line  dual  of  the  Voronoi  diagram  of  5,-  is  easy  to 
obtain.  Suppose  that  (h,v)  is  an  edge  of  this  diagram.  (u,v)  is  a  portion  of  some  perpen- 
dicular bisector.  Let  the  points  of  S;  associated  with  this  bisector  be  a  and  b.  Then,  (a,b) 
is  an  edge  of  the  straight  line  dual.  Since,  in  the  computation  of  the  Voronoi  diagram 
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(step  3  of  Figure  4.8),  we  associate  with  each  bisector  the  two  points  of  S,-  that  generate 
the  perpendicular  bisector,  the  edges  of  the  dual  are  easily  generated. 


Figure  4.22  Partitioning  region  exterior  to  convex  hulls 

When  implementing  steps  2  and  3,  it  is  best  to  combine  them  so  that  when  we  detect 
mn  <  mx  (in  step  2),  the  edge  (p,  pk)  is  generated  provided  p.x  >  Pk-X  (this  avoids  the 
generation  of  two  copies  of  each  edge  of  the  dual).  Since  the  number  of  edges  in  the  tri- 
angulation  of  S,-  is  at  most  3  N1'3  -  6,  these  edges  may  be  routed  to  the  Nm  row  1 
processors  in  O(l)  time.  Each  processor  in  row  1  gets  atmost  3  edges. 

Following  step  3,  the  regions  enclosed  by  the  convex  hulls  of  the  S,'s,  has  been  tri- 
angualted  (Figure  4.22)  and  we  need  to  triangulate  the  region  outside.  For  this,  we  use 
the  technique  of  Wang  and  Tsin  [WANGT87].  The  exterior  region  is  partitioned  into 
'simple  polygons'  (Figure  4.22)  which  are  then  triangulated  independently.  The  algo- 
rithm of  [WANGT87]  to  partition  the  external  region  is  given  in  Figure  4.22.  This  has 
been  modified  to  account  for  the  fact  that  we  have  N2/3  5,'s  while  Wang  and  Tsin  have 
onlyW1'2. 
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Step  1:     Compute  the  convex  hull  CH(i)  of  each  Sit  1  <  i  <  N2'3. 

Step  2:  For  each  5,-,  determine  the  upper,  Uyt  and  lower,  L;y,  tangents  of  S,  with 
respect  to  Sj,  1  £  j  <  i  <  N2'3. 

Step  3:  For  each  5,-,  determine  the  upper  tangent  £/,-  that  has  the  least  slope  among  all 
Uij,  j  <  i  and  the  lower  tangent  L;  that  has  maximum  slope  among  all  Li;-, 

J  <  '• 

Step  4:  For  each  5,-,  1  <  i  <  N213  determine  the  line  M,-  that  joins  the  rightmost  point 
of  CH(i  - 1 )  and  the  leftmost  point  of  CH(i). 

Step  5:  Identify  the  exterior  polygons  formed  by  the  £/,'s,  L,'s,  M,'s,  and  the 
boundaries  of  the  convex  hull  of  the  5,'s.  There  are  exacdy  2N213  -  2  such 
polygons. 


Figure  4.22  WANG's  Algorithm  [WANGT87]  to  partition  the  exterior  of  the  CH(i)  's 

into  polygons 

Step  1  is  easily  done  in  0(1)  time  using  an  N  x  Af 1/3  submesh  for  each  5,-  using  the 
N3  processor  algorithm  of  Figure  2.  Step  2  can  be  done  in  O(l)  time  using  an 
N113  x  N1'3  submesh  for  each  SjJ  <  i.  The  least  slope  Uij,j  <  i  can  be  found  by  first 
finding  the  least  slope  U l}  in  each  group  ofN113  Uij's.  This  leaves  us  with  at  most//1 
candidates  from  which  the  least  slope  one  can  be  found.  The  computation  is  confined  to 
one  N  x  N1/3  submesh  for  each  5,-.  The  leftmost  and  rightmost  points  of  5,-  are  easily 
identified  in  0(1)  time.  This  may  be  done  by  using  anN1'3  x  N1'3  submesh  for  each  Si 
and  comparing  each  point  of  5,  against  each  other  point  or  by  using  the  convex  hull  of  S,- 
and  comparing  adjacent  points  in  the  hull.  Hence,  step  4  is  also  easy  to  do  in  0(1)  time. 
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Wang  and  Tsin  [WANGT87]  have  shown  that  the  algorithm  of  Figure  4.23 
correctly  identifies  the  exterior  polygon  to  which  each  point  of  the  convex  hull  of  the  5,'s 
belongs.  In  this  algorithm,  /,•  and  r,-,  respectively,  denote  the  leftmost  and  rightmost 
points  of  CH(i);  at  and  &,  denote  the  end  points  of  V rj  with  ft,  being  a  point  of  C//(5,) 
and  a i  being  a  point  of  CH(j)  for  some;  <  i;  and  B(u)  denotes  the  exterior  polygon  to 
which  point  u  belongs.  The  portion  of  the  upper  convex  hull  of  S ;  that  is  comprised  of 
the  hull  points  from  bv  to  /,•  (counterclockwise)  defines  a  unique  exterior  polygon  (see 
Figure  4.22).  This  polygon  is  called  P ',-.  While  it  is  possible  for  a  point  u  to  belong  to 
more  than  one  P  t,  B(u)  is  defined  to  be  the  largest;  such  that  u  is  a  point  of  Pj.  Note  that 
points  on  the  lower  hull  of  the  S,'s  are  on  polygons  bounded  from  above  by  the  M,'s. 
These  are  labeled  Q's  in  Figure  4.22.  B(u)  may  be  similarly  defined  for  these  points. 

To  implement  the  algorithm  of  Figure  4.23,  it  is  best  to  identify  the  a,-,  bt,  rit  lt 
vertices  associated  with  each  5,-  during  the  computation  of  u,-  and  /,•  in  step  3  of  Figure 
4.22  and  during  the  computation  of  A/,-  in  step  4  of  Figure  4.22.  Step  1  of  Figure  4.23  is 
easily  performed  in  0(1)  time  using  an/VxiV1'3  submesh  for  each  5,-.  For  step  2,  we 
note  that  there  is  exactly  one  frj  associate  with  each  5,-,  i  >  1.  The  corresponding  a,-  can 
be  stored  along  with  it  when  U t  is  computed  in  Figure  4.22.  This  results  in  a  pair  (a,-,  bt) 
stored  in  a  single  processor.  To  compute  fi(a,),  we  need  to  accumulate  in  the  N  x  N11 
submesh  associated  with  5,-,  all  (ay,  bj)  pairs  for;  >  i.  This  can  be  done  by  having  the 
PE  that  contains  (a,-,  b{)  route  the  pair  to  the  row  i  processor  on  its  column  (using  a 
column  bus)  and  then  broadcasting  these  pairs  along  row  buses  to  the  N  x  N113 
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{Find  B(u)  for  points  in  the  upper  hull  of  the  S,'s} 

Step  1:  For  all  points  u  that  are  between  their  ft,-  and  /,•  when  moving  along  CH(i) 
counterclockwise  from  ft,-  (ft,-  excluded,  /,•  included),  set  B(u)  =  i. 

Step  2:     If  wis  an  a,-,  then  B  (u)  =  least;  such  that  a}.x  <   at.x  <   bj.x. 

Step  3:     If  u  is  a  £,-,  then  B(u)  is  computed  similar  to  the  step  2  computation  of  fl(a,). 

Step  4:     Foru  =  r,-,  setfi(u)  =  i  +  1. 

Step  5:  Let  ft,-»r,-  denote  the  segment  of  the  upper  hull  of  CH(i),  1  <  i  <  N2'2  with 
endpoints  ft,  and  r;.  Sort  bt,  rt,  and  all  a/s  in  ft,->r,  into  ascending  order  by 
^-coordinate. 

Step  6:  Let  ft,  =  a(l,  a,-2,  ...,  ah  =  r,-  be  the  sorted  sequence  ofay's  in  ft,-»r,.  For 
each  m  *  fl| ,  1  <  y  <  f  in  ft,->r,-,  set  B(w)  =  &(<>(/)  «ff«l/--i  «*  <   u.x  < 

dij.X. 

{Find  5 (u)  for  points  in  the  lower  hull  of  the  S,'s} 
[Similar  to  steps  1-6] 

Figure  4.23  Algorithm  to  determine  B(u) 
submeshes  that  need  them.  The  least  j  that  satisties  the  inequality  of  step  2  is  now  easily 
found  in  0(1)  time  ( in  each  N  x  N1/3  submesh).  Step  3  is  similar  to  step  2  and  step  4  is 
trivially  done  in  0(1)  time. 

The  segment  ft,— »r,-  of  the  upper  hull  of  CH(i)  is  easily  identified  from  CH(i)  in 
0(1)  time  for  each  S,-.  All  points  on  this  segment  that  are  ay's  have  been  identified  during 
step  3  of  Figure  4.22.  The  sort  of  step  5  can  therefore  be  done  in  0(1)  time  using  an 
N2/3  x  N1/3  submesh  of  the  N  x  N1/3  submesh  that  corresponds  to  each  5,-.  Following 
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the  sort,  each  point  of  &,->r,-  that  is  not  an  a}  can  determine  its  u  value  in  0(1)  time 
using  a  row  of  N1'3  processors.  For  this,  the  sorted  a/s  are  broadcast  down  column 
buses  of  each  N  x  N113  submesh  and  the  unique  j  satisfying  the  inequality  of  step  6  is 
found  using  row  bus  splitting.  Once  the  B(u)'s  have  been  computed,  the  convex  hull 
points  of  all  the  5,'s  are  sorted  by  their  B(u)  values.  This  brings  points  in  the  same 
polygon  next  to  each  other.  The  sort  is  accomplished  in  0(1)  time  using  the  algorithm  of 
[NIGA92].  To  triangulate  each  polygon,  it  is  desirable  to  have  the  polygon  points  ordered 
by  the  5,'s  from  which  they  came  and  within  5,-,  ordered  in  convex  hull  order.  This  can 
be  incorporated  into  the  sort  by  B(u).  Each  polygon  can  now  be  triangulated  using  an 
N  x  dt  (where  di  is  the  size  of  the  polygon)  submesh.  This  triangulation  is  fairly 
straightforward  as  each  polygon  is  comprised  of  two  monotone  segments.  The  steps  are 
given  in  [WANGT87]  for  a  PRAM  and  are  easily  performed  on  an  RMESH  or  PARBUS 
of  size  N  x  di(N  £  d4)  in  O(l)  time. 

Conclusions 


We  have  developed  constant  time  RMB  algorithms  for  the  convex  hull,  the  smallest 
enclosing  rectangle,  ECDF  searching,  and  triangulation  problems.  All  of  our  algorithms 
are  for  the  case  of  N  planar  points  and  all  work  on  N  x  N  RMBs.  The  algorithms  apply 
to  all  the  three  models  (RMESH/PARBUS/MRN). 


CHAPTER  5 
CONCLUSIONS 
Summary 
This  dissertation  was  motivated  by  the  promise  that  Reconfigurable  Mesh  with  Buses 
(RMB)  is  a  powerful  model  of  parallel  computation  compared  with  PRAM  (Shared 
Memory  Model),  existing  fixed  topology  interconnection  networks  ,  and  mesh  with  static 
buses.  The  dynamic  reconfiguration  of  buses,  using  different  switch  settings  at  a 
processor  node  during  computation   under  software  control,   allows  for  increased 
flexibility  during  computation.   There  are  large  number  of  possible  configurations  on 
RMB  compared  to  a  fixed  number  of  interconnection  patterns  stored  in  a  switch  on  a 
CHiP  [SYND82].  This  capability  is  very  useful  because  one  may  want  to  reconfigure  the 
processors  into  different  interconnection  topologies  at  various  stages  of  the  solution  of  a 
problem. 

The  variation  of  allowable  switch  setting  at  a  processor  node  results  in  different 
RMB  architectures,  RMESH,  PARBUS,  MRN.  RMESH  is  the  most  restricted  model 
which  does  not  allow  a  cross-over  connection  (such  as  N-W  and  S-E  connected  together). 
PARBUS  is  the  most  flexible  model  allowing  for  all  possible  switch  settings.  The  prob- 
lem of  simulating  an  PARBUS  on  an  RMESH  architecture  is  not  considered  here.  One 
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can  easily  simulate  anN  x  N PARBUS  on  an  3-dimensional  5N  x  5iV RMESH  with  two 
layers  by  simulating  a  node  of  PARBUS  with  a  set  of  5  x  5  RMESH  nodes  on  two 
layers.  One  can  easily  simulate  the  RMESH  switch  settings  on  an  PARBUS  architecture. 
We  concentrated  on  developing  0(1)  time  algorithms  for  many  important  problems  in 
Computer  Science  such  as  sorting,  determining  convex  hull  of  N  planar  points  etc. 

We  considered  sorting  n  numbers  onnxn  reconfigurable  meshes  with  buses.  We 
concentrated  on  developing  optimal  area* time  complexity  [THOM79,80]  (a  VLSI  com- 
plexity measure)  algorithms.  We  also  extend  our  results  to  higher  dimension  RMBs. 

We  have  developed  relatively  simple  algorithms  to  sort  n  numbers  on  n  x  n 
reconfigurable  with  buses.  For  the  case  of  the  RMESH,  our  algorithms  are  the  first  to  sort 
in  0(1)  time.  For  the  PARBUS,  our  algorithms  are  simpler  than  those  of  [JANG92a]  and 
[LIN92].  Our  PARBUS  column  sort  algorithm  is  the  fastest  of  our  algorithms  for  the 
PARBUS.  It  uses  fewer  broadcasts  than  does  the  optimized  versions  of  the  selection 
algorithm  of  [LIN92].  Our  PARBUS  algorithms  can  be  run  on  an  MRN  with  no 
modifications.  Since  n  x  n  reconfigurable  meshes  require  n2  area  for  their  layout.  Our 
algorithms  (as  well  as  those  of[JANG92]  and  [LIN92])  have  an  area  time  square  product 
AT2  of  n2  which  is  the  best  one  can  hope  for  in  view  of  the  lower  bound  result  AT2  >  n2 
for  the  VLSI  word  model  [LEIG85]. 

Using  two  dimensional  meshes  with  buses,  we  are  able  to  sort  n  elements  in  0(1) 
time  using  n2  processors.  Using  higher  dimensional  RMB,  one  can  sort  n  numbers  in 
O(l)  time  using  fewer  processors.  In  general,  n  =  Nk  numbers  can  be  sorted  in  0(1) 
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time  using  iV*  +  1  =  nl  +  1,k  processors  in  a  k  +  l  dimensional  configuration.  While  the 
same  result  has  been  shown  for  a  PARBUS  [JANG92b],  our  algorithm  applies  to  an 
RMESH  also. 

We  also  considered  the  problem  of  sorting  n  numbers  on  n1'2  x  n112  meshes  and 
RMBs.  The  nature  of  sorting  problem  (high  interprocessor  communication  in  terms  of 
data  movement)  does  not  give  any  advantage  over  meshes  without  buses.  We  considered 
this  problem  on  various  mesh  models  discussed  in  the  literature  namely  bidirectional, 
unidirectional  and  strict  unidirectional  mesh. 

We  have  shown  that  the  lower  bound  for  the  number  of  routes  needed  to  sort  on  an 
n  x  n  bidirectional  mesh  that  was  established  in  [SCHN86]  is  incorrect.  Furthermore,  we 
have  provided  algorithms  that  sort  using  fewer  routes  than  the  lower  bound  of 
[SCHN86].  In  fact,  the  algorithm  of  chapter  3  is  able  to  sort  using  a  number  of  routes  that 
is  very  close  to  the  distance  lower  bound  for  bidirectional  as  well  as  strict  unidirectional 
meshes. 

The  fold/sort/unfold  algorithms  developed  in  chapter  3  have  practical  implications 
to  sorting  on  a  mesh.  The  advantages  of  this  technique  when  applied  to  computers  with 
tR  »  2  tc  were  pointed  out  in  chapter  4.  We  also  showed  how  to  simulate  the  mesh 
algorithms  on  reconfigurable  meshes  with  buses. 

We  also  considered  computational  geometry  problems  on  N  x  N  reconfigurable 
meshes  with  buses  and  gave  0(1)  time  algorithms  to  determine  the  convex  hull  of  a  set  of 


N  planar  points,  find  the  smallest  enclosing  rectangle  of  a  set  of  N  planar  points,  to  tri- 
angulate a  set  of  N  planar  points  and  for  the  ECDF  search  problem  for  N  planar  points. 

We  showed  that  Reisis's  [REIS92]  algorithm  for  determining  the  convex  hull  of  N 
planar  points  on  RMESH  is  incorrect.  Jang's  [JANG92e]  algorithms  are  given  for  N  x  N 
PARBUS  and  will  not  work  on  an  N  x  N  RMESH  in  0(1)  time.  We  developed  an  0(1) 
time  algorithm  to  determine  the  voronoi  diagram  of  N  planar  points  on  N4  processors  as 
a  part  of  our  algorithm  to  triangulate  N  planar  points.  Our  algorithms  work  on  all 
Reconfigurable  Meshes  with  Buses  architectures  (RMESH/PARBUS/MRN). 

Future  work 


The  work  described  in  this  dissertation  may  be  extended  to  investigate  following 
interesting  problems. 

1.  Is  there  an  0(1)  time  algorithm  for  determining  a  Voronoi  diagram  of  N  planar  points 
onN  x  N RMBs  ? 

2.  Is  there  an  0(1)  time  algorithm  for  determining  rectilinear  Voronoi  diagram  of  N 
planar  points  on  N  x  N  RMBs  ? 

3.  Is  there  an  0(1)  simulation  of  an  N  x  W  PARBUS  on  a  N  x  N  RMESH  ?  If  yes,  is  this 
true  for  higher  dimensions  ? 

4.  Is  there  an  0(1)  time  image  connected  component  algorithm  forNxN  pixels  on  an 
W x  NRMB? 
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